python - BeautifulSoup 抓取二手车列表
问题描述
我正在尝试制作一个程序,从网站上抓取二手车列表并输出该汽车列表的链接、价格、里程和发动机功率。现在它只在第一个列表中重复。它应该输出页面上的每个列表。
该网站是爱沙尼亚语的,我希望这不是问题。
import requests
from bs4 import BeautifulSoup
import unicodedata
url = 'https://www.auto24.ee/kasutatud/nimekiri.php?bn=2&a=100&b=7&ae=2&af=50&ssid=21570860'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
for div in soup.find_all('div', {'class' : 'result-row'}):
def getLink():
find_link = soup.find('a', {'class' : 'main'})
link = (find_link.get('href'))
link_string = ('https://www.auto24.ee' + link)
return link_string
def getPrice():
find_price = soup.find('span', {'class' : 'price'})
price = (find_price.get_text())
price_string = unicodedata.normalize("NFKD", price)
return price_string + ','
def getMileage():
find_mileage = soup.find('span', {'class' : 'mileage'})
mileage = (find_mileage.get_text())
return mileage + ','
def getPower():
engine = requests.get(getLink())
kW_string = 'kW'
engine_stats = BeautifulSoup(engine.text, 'lxml')
if engine_stats.find(kW_string) != -1:
power_find = engine_stats.find('tr', {'class' : 'field-mootorvoimsus'})
power = power_find.find('span', {'class' : 'value'})
power_string = power.get_text()
return power_string
else:
return ('Engine power not specified.')
print(getLink() + ',', getPrice(), getMileage(), getPower())
输出:
https://www.auto24.ee/soidukid/3554965, 1600 €, 174 000 km, 1.8
https://www.auto24.ee/soidukid/3554965, 1600 €, 174 000 km, 1.8
https://www.auto24.ee/soidukid/3554965, 1600 €, 174 000 km, 1.8
https://www.auto24.ee/soidukid/3554965, 1600 €, 174 000 km, 1.8
...等等。
解决方案
如果您也查看页面的 URL,则 URL 会发生变化,因此我们可以使用该部分ak=0
,ak=50
依此类推以根据网页获取数据
import requests
from bs4 import BeautifulSoup
for i in range(0,150,50):
print(i)
res=requests.get(f"https://www.auto24.ee/kasutatud/nimekiri.php?bn=2&a=100&b=7&ae=2&af=50&ssid=21612624&ak={i}")
soup=BeautifulSoup(res.text,"html.parser")
main_data=soup.find("div",attrs={"id":"usedVehiclesSearchResult-flex"}).find_all("div",class_="description")
for i in main_data:
print(i.find("a",class_="main")['href'],end=" ")
print(i.find("span",class_="engine").get_text(),end=" ")
print(i.find("span",class_="price").get_text(),end=" ")
try:
print(i.find("span",class_="mileage").get_text())
except AttributeError:
print("NAN")
输出:
0
/soidukid/3554965 1.8 450 € 174 000 km
/soidukid/3563070 1.9 85kW 450 € 514 000 km
/soidukid/3564181 1.6 74kW 500 € 323 032 km
/soidukid/3563999 1.8 85kW 500 € 374 699 km
/soidukid/3550730 2.0 85kW 500 € 420 000 km
..
推荐阅读
- javascript - Gatsby 页面生成没有过滤掉 slug
- c# - 如何从用户获取输入值,然后发送到 ASP.NET MVC 中的 2 个不同的控制器?
- arrays - 查找多次出现的重复元素
- python - 用日期时间Python替换字符串
- react-native - react native如何自定义底部导航
- python - 如何将字符串列表转换为键以给定子字符串开头的列表字典
- compiler-errors - 我可以控制 g++ 编译器的错误输出吗?
- upload - Realtek 8111H 以太网上传速度 2 Mbps 时应该是 20Mbps
- python - Python/我不理解列表理解中的一行循环
- javascript - 在 Laravel 中,我希望页面历史记录不存储在浏览器历史记录中?