python - 用漂亮的汤刮掉一个网站。但它不能刮每个
问题描述
我是一个完整的初学者,我在网络抓取中遇到了一些问题,我能够抓取图片、标题和价格。并且成功地抓取了索引[0]
但是,每当我尝试运行循环或将索引硬编码为高于 0 的值时,它都表明它超出了范围。而且它不会刮掉任何其他<li>
标签。有没有其他方法可以解决这个问题?此外,我加入了 selenium 以加载整个页面。任何帮助将不胜感激。
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://ca.octobersveryown.com/collections/all")
scrolls = 22
while True:
scrolls -= 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(0.2)
if scrolls < 0:
break
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bodies = (soup.find(id='content'))
clothing = bodies.find_all('ul', class_='grid--full product-grid-items')
for span_tag in soup.findAll(class_='visually-hidden'):
span_tag.replace_with('')
print(clothing[0].find('img')['src'])
print(clothing[0].find(class_='product-title').get_text())
print(clothing[0].find(class_='grid-price-money').get_text())
time.sleep(8)
driver.quit()
解决方案
如果您只想在BeautifulSoup
没有 selenium 的情况下使用,您可以模拟页面正在发出的 Ajax 请求。例如:
import requests
from bs4 import BeautifulSoup
url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
page = 1
while True:
soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
li = soup.find_all('li', recursive=False)
if not li:
break
for l in li:
print(l.select_one('p a').get_text(strip=True))
print('https:' + l.img['src'])
print(l.select_one('.grid-price').get_text(strip=True, separator=' '))
print('-' * 80)
page += 1
印刷:
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-dark-red-1_large.jpg?v=1598583974
£178.00
--------------------------------------------------------------------------------
LIGHTWEIGHT RAIN SHELL
https://cdn.shopify.com/s/files/1/1605/0171/products/lightweight-rain-shell-black-1_large.jpg?v=1598583976
£178.00
--------------------------------------------------------------------------------
ALL COUNTRY HOODIE
https://cdn.shopify.com/s/files/1/1605/0171/products/all-country-hoodie-white-1_large.jpg?v=1598583978
£148.00
--------------------------------------------------------------------------------
...and so on.
编辑(保存为 CSV):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://uk.octobersveryown.com/collections/all?page={page}&view=pagination-ajax'
page = 1
all_data = []
while True:
soup = BeautifulSoup( requests.get(url.format(page=page)).content, 'html.parser' )
li = soup.find_all('li', recursive=False)
if not li:
break
for l in li:
d = {'name': l.select_one('p a').get_text(strip=True),
'link': 'https:' + l.img['src'],
'price': l.select_one('.grid-price').get_text(strip=True, separator=' ')}
all_data.append(d)
print(d)
print('-' * 80)
page += 1
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
印刷:
name ... price
0 LIGHTWEIGHT RAIN SHELL ... £178.00
1 LIGHTWEIGHT RAIN SHELL ... £178.00
2 ALL COUNTRY HOODIE ... £148.00
3 ALL COUNTRY HOODIE ... £148.00
4 ALL COUNTRY HOODIE ... £148.00
.. ... ... ...
271 OVO ESSENTIALS LONGSLEEVE T-SHIRT ... £58.00
272 OVO ESSENTIALS POLO ... £68.00
273 OVO ESSENTIALS T-SHIRT ... £48.00
274 OVO ESSENTIALS CAP ... £38.00
275 POM POM COTTON TWILL CAP ... £32.00 SOLD OUT
[276 rows x 3 columns]
并保存data.csv
(来自 LibreOffice 的屏幕截图):
推荐阅读
- ionic4 - ionic 4 angular 尝试运行代码时出现此错误
- elasticsearch - 如何使用分隔符拆分字段并使用 ELK 中的摄取节点插入新字段?
- angular - 如何在离子幻灯片中查看对象(来自 ts 文件)
- python - 访问子字典的键以获取列表中某些键的所有值
- java - 在哪里查看 JXBrowser 和 Selenium 匹配的版本?
- python - 直接分配给 globals() 不好吗?
- c# - 知道最后安装的屏幕
- c# - 延长 Auth cookie 到期日期以保持用户登录
- django - 如何在使用 dlib 的 herkou 上部署应用程序?
- javascript - 我希望屏幕是白色的,直到我的 Tamper Monkey 脚本加载