python - 使用 beautifulsoup4,避免在不存在的元素上出现 AttributeError
问题描述
一天中的好时光!
在进行抓取项目时,我遇到了一些问题。目前我正在起草一份草案:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import requests
from bs4 import BeautifulSoup
import time
import random
#driver Path
PATH = "C:\Program Files (x86)\chromedriver"
BASE_URL = "https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167"
driver = webdriver.Chrome(PATH)
driver.implicitly_wait(30)
driver.get(BASE_URL)
time.sleep(random.uniform(3.0, 5.0))
btn = driver.find_elements_by_xpath('//*[@id="uc-btn-accept-banner"]')[0]
btn.click()
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.content, "html.parser")
def reader(url):
ls = list()
ImmoWebCode = url.find(class_ ="classified__information--immoweb-code").text.strip()
Price = url.find("p", class_="classified__price").find("span",class_="sr-only").text.strip()
Locality = url.find(class_="classified__information--address-row").find("span").text.strip()
HouseType = url.find(class_="classified__title").text.strip()
LivingArea = url.find("th",text="Living area").find_next(class_="classified-table__data").next_element.strip()
RoomsNumber = url.find("th",text="Bedrooms").find_next(class_="classified-table__data").next_element.strip()
Kitchen = url.find("th",text="Kitchen type").find_next(class_="classified-table__data").next_element.strip()
TerraceOrientation = url.find("th",text="Terrace orientation").find_next(class_="classified-table__data").next_element.strip()
TerraceArea = url.find("th",text="Terrace").find_next(class_="classified-table__data").next_element.strip()
Furnished = url.find("th",text="Furnished").find_next(class_="classified-table__data").next_element.strip()
ls.append(Furnished)
OpenFire = url.find("th", text="How many fireplaces?").find_next(class_="classified-table__data").next_element.strip()
GardenOrientation = url.find("th", text="Garden orientation").find_next(class_="classified-table__data").next_element.strip()
ls.append(GardenOrientation)
GardenArea = url.find("th",text="Garden surface").find_next(class_="classified-table__data").next_element.strip()
PlotSurface = url.find("th",text="Surface of the plot").find_next(class_="classified-table__data").next_element.strip()
ls.append(PlotSurface)
FacadeNumber = url.find("th",text="Number of frontages").find_next(class_="classified-table__data").next_element.strip()
SwimmingPoool = url.find("th",text="Swimming pool").find_next(class_="classified-table__data").next_element.strip()
StateOfTheBuilding = url.find("th",text="Building condition").find_next(class_="classified-table__data").next_element.strip()
return ls
print(reader(soup))
我开始面临问题,当代码到达“Locality”时,我收到一个Exception has occurred: AttributeError 'NoneType' object has no attribute 'find
,尽管很明显提到的元素存在于 HTML 代码中。我坚持认为这是一个合成器问题,但我无法解决这个问题。
这让我想到了第二个问题:由于此代码将在多个页面上运行,因此这些页面可能没有请求的元素。None
如果它发生,我该如何放置价值。
非常感谢您!
源代码:
<div class="classified__header-secondary-info classified__informations"><p class="classified__information--property">
3 bedrooms
<span aria-hidden="true">|</span>
199
<span class="abbreviation"><span aria-hidden="true">
m² </span> <span class="sr-only">
square meters </span></span></p> <div class="classified__information--financial"><!----> <a href="https://www.immoweb.be/en/credit-application?classified=9308167&icid_to=mortgage&icid_cta=classified-header" class="classified__information--mortgage-banner small"><!----> <span class="mortgage-banner__text">Request your mortgage loan</span></a></div> <div class="classified__information--address"><p><span class="classified__information--address-row"><span>
1340
</span> <span aria-hidden="true">—</span> <span>
Ottignies
</span>
|
</span> <button class="button button--text button--size-small classified__information--address-button">
Ask for the exact address
</button></p></div> <div class="classified__information--immoweb-code">
Immoweb code : 9308167
</div></div>
解决方案
这些数据不都在<table>
网站的标签中吗?可以只使用熊猫:
import requests
import pandas as pd
url = 'https://www.immoweb.be/en/classified/house/for-sale/ottignies/1340/9308167'
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
response = requests.get(url,headers=headers)
dfs = pd.read_html(response.text)
df = pd.concat(dfs).dropna().reset_index(drop=True)
df = df.pivot(index=None, columns=0,values=1).bfill().iloc[[0],:]
输出:
print(df.to_string())
0 Address As built plan Available as of Available date Basement Bathrooms Bedrooms CO₂ emission Construction year Double glazing Elevator Energy class External reference Furnished Garden Garden orientation Gas, water & electricity Heating type Investment property Kitchen type Land is facing street Living area Living room surface Number of frontages Outdoor parking spaces Price Primary energy consumption Reference number of the EPC report Shower rooms Surface of the plot Toilets Website Yearly theoretical total energy consumption
0 Grand' Route 69 A 1435 - Corbais No To be defined December 31 2022 - 12:00 AM Yes 1 3 Not specified 2020 Yes Yes Not specified 8566 - 4443 No Yes East Yes Gas No Installed No 199 m² square meters 48 m² square meters 2 1 € 410,000 410000 € Not specified Not specified 1 150 m² square meters 3 http://www.gilmont.be Not specified
推荐阅读
- java - 不兼容的 Spring Boot 版本导致应用程序无法启动。我们如何识别 pom 中包含的错误版本
- swift - 在按钮单击大纲视图上创建新的列单元格
- linux - 有没有办法使用 Wireshark 或 Tcpdump 检查套接字优先级?
- python - 如何使用 gensim KeyedVectors 减去和添加向量?
- tcp - TCP上的注册消息
- javascript - 使用样式参数的样式组件与功能组件
- linux - 从 Linux 命令行仅生成 SHA-256 哈希
- selenium - ChromeDriverService 和 Azure DevOps 代理
- ios - 通过 CocoaPods 添加 AudioKit 后的链接错误
- mysql - 在 SQL 过程中执行,限制为 i