python - Web Scrape - 任何人都可以协助清理这个吗?
问题描述
在这方面我还很陌生,而且我已经在这个网页上工作了好几天了。我一直在积极尝试避免问这个问题,但我被严重卡住了。
我的问题
- 我有 span 循环当前定位的位置,它每次运行“for product”循环时都会打印每个列表的所有价格。如果我把它放在这个循环之外,它要么打印列表中的第一个,要么打印列表中的最后一个。如何提取价格并将其打印在每个单独的产品旁边。
我知道我列出了很多未使用的进口商品。这些只是我正在尝试但尚未删除它们的各种途径。
这里的最终目标是推送到 json 或 csv 文件(目前也没有编写 - 但有一个公平的想法,一旦有数据如何处理这个方面。
from bs4 import BeautifulSoup
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib
import locale
import json
from selenium import webdriver
os.environ["PYTHONIOENCODING"] = "utf-8"
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")
for product in products:
#title
title = product.find("h3")
titleText = title.text if title else ''
#manufacturer name
manufacturer = product.find("div", "GC5 ProductManufacturer")
manuText = manufacturer.text if manufacturer else ''
#image location
img = product.find("div", "ProductImage")
imglinks = img.find("a") if img else ''
imglinkhref = imglinks.get('href') if imglinks else ''
imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
#print(imgurl.replace('..', ''))
#description
description = product.find("div", "GC12 ProductDescription")
descText = description.text if description else ''
#more description
more = product.find("div", "GC12 ProductDetailedDescription")
moreText = more.text if more else ''
#price - not right
spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
for i in range(0,len(spans),2):
span = spans[i].text
print(span)
i+=1
print(titleText)
print(manuText)
print(descText)
print(moreText)
print(imgurl.replace('..', ''))
print("\n")
输出:
£1,695.00
£1,885.00
£1,885.00
£2,015.00
£2,175.00
£2,175.00
£2,385.00
£2,115.00
£3,025.00
£3,315.00
£3,635.00
£3,925.00
£2,765.00
£3,045.00
£3,325.00
£3,615.00
£3,455.00
£2,815.00
£3,300.00
£3,000.00
£7,225.00
£7,555.00
£7,635.00
£7,015.00
£7,355.00
12G Beretta DT11 Trap Adjustable
beretta
Click on more for full details.
You may order this gun online, however due to UK Legislation we are not permitted to ship directly to you, we can however ship to a registered firearms dealer local to you. Once we receive your order, we will contact you to arrange which registered firearms dealer you would like the gun to be shipped to.
DT 11 Trap (Steelium Pro)
12
2 3/4"
10x10 rib
3/4&F
30"/32" weight; 4k
https://www.mcavoyguns.co.uk/contents/media/l_dt11_02.jpg
解决方案
@baduker 的评论为我指明了正确的方向——缩进。谢谢!
import requests
import shutil
import csv
import pandas
from pandas import DataFrame
import re
import os
import urllib
import locale
import json
from selenium import webdriver
os.environ["PYTHONIOENCODING"] = "utf-8"
browser = webdriver.Chrome(executable_path='C:/Users/andrew.glass/chromedriver.exe')
browser.get("https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html")
URL = 'https://www.mcavoyguns.co.uk/contents/en-uk/d130_Beretta_Over___Under_Competeition_shotguns.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", "GC62 Product")
for product in products:
#title
title = product.find("h3")
titleText = title.text if title else ''
#manufacturer name
manufacturer = product.find("div", "GC5 ProductManufacturer")
manuText = manufacturer.text if manufacturer else ''
#image location
img = product.find("div", "ProductImage")
imglinks = img.find("a") if img else ''
imglinkhref = imglinks.get('href') if imglinks else ''
imgurl = 'https://www.mcavoyguns.co.uk/contents'+imglinkhref
#print(imgurl.replace('..', ''))
#description
description = product.find("div", "GC12 ProductDescription")
descText = description.text if description else ''
#more description
more = product.find("div", "GC12 ProductDetailedDescription")
moreText = more.text if more else ''
#price - not right
spans = browser.find_elements_by_css_selector("div.GC20.ProductPrice span")
for i in range(0,len(spans),2):
span = spans[i].text
print(span)
i+=1
print(titleText)
print(manuText)
print(descText)
print(moreText)
print(imgurl.replace('..', ''))
print("\n")
推荐阅读
- c# - C# datagridview 列在排序时清除
- mysql - 使用“源 Table.sql”时,mariadb 控制台不接受变异元音(元音变音)
- c# - 在现有文件的详细信息选项卡中添加/编辑新的扩展属性
- python - 在我切换窗口之前,OSX 上的 Tkinter 按钮浮雕不会改变
- javascript - Webpack ES6- 使用动态导入加载 Json(保留 json 文件)
- c# - Firestore REST 身份验证
- batch-file - 为什么重新格式化成功的批处理文件会导致失败?
- java - Android改造v2插入失败错误403
- python - Python中的LBP实现
- c# - 在 Elasticsearch 6.x 中结合 Join 类型和 Nested 类型并进行查询