html - 在 IBM Watson Studio Jupyter Notebook 中使用 BeautifulSoup 进行 Web 抓取不起作用
问题描述
我希望从此搜索结果页面中抓取 IBM Watson Studio Jupyter Notebook 中的数据:
https://www.aspc.co.uk/search/?PrimaryPropertyType=Rent&SortBy=PublishedDesc&LastUpdated=AddedAnytime&SearchTerm=&PropertyType=Residential&PriceMin=&PriceMax=&Bathrooms=&OrMoreBathrooms=true&Bedrooms=&OrMoreBedrooms=true&HasCentralHeating=false&HasGarage=false&HasDoubleGarage=false&HasGarden=false&IsNewBuild=false&IsDevelopment=false&IsParkingAvailable=false&IsPartExchangeConsidered=false&PublicRooms=&OrMorePublicRooms=true&IsHmoLicense=false&IsAllowPets=false&IsAllowSmoking=false&IsFullyFurnished=false&IsPartFurnished=false&IsUnfurnished=false&ExcludeUnderOffer=false&IncludeClosedProperties=true&ClosedDatesSearch=14&MapSearchType=EDITED&ResultView=LIST&ResultMode=NONE&AreaZoom=13&AreaCenter[lat]=57.14955426557916&AreaCenter[lng]=-2.0927401123046785&EditedZoom=13&EditedCenter[lat]=57.14955426557916&EditedCenter[lng]=-2.0927401123046785
我已经尝试过 BeautifulSoup 并尝试过 Selenium(完全披露:我是初学者)对多种代码变体。我已经解决了关于 Stack Overflow、Medium 文章等的几十个问题,但我无法理解我做错了什么。
我正在做的最新一个是:
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
properties_containers = html_soup.find_all('div', class_ = 'information-card property-card col ')
print(type(properties_containers))
print(len(properties_containers))
这将返回 0。
<class 'bs4.element.ResultSet'>
0
有人可以指导我正确的方向吗?我做错了什么/错过了什么?
解决方案
您看到的数据是通过 JavaScript 加载的。BeautifulSoup 无法执行它,但您可以使用requests
模块从其 API 加载数据。
例如:
import json
import requests
url = 'https://www.aspc.co.uk/search/?PrimaryPropertyType=Rent&SortBy=PublishedDesc&LastUpdated=AddedAnytime&SearchTerm=&PropertyType=Residential&PriceMin=&PriceMax=&Bathrooms=&OrMoreBathrooms=true&Bedrooms=&OrMoreBedrooms=true&HasCentralHeating=false&HasGarage=false&HasDoubleGarage=false&HasGarden=false&IsNewBuild=false&IsDevelopment=false&IsParkingAvailable=false&IsPartExchangeConsidered=false&PublicRooms=&OrMorePublicRooms=true&IsHmoLicense=false&IsAllowPets=false&IsAllowSmoking=false&IsFullyFurnished=false&IsPartFurnished=false&IsUnfurnished=false&ExcludeUnderOffer=false&IncludeClosedProperties=true&ClosedDatesSearch=14&MapSearchType=EDITED&ResultView=LIST&ResultMode=NONE&AreaZoom=13&AreaCenter[lat]=57.14955426557916&AreaCenter[lng]=-2.0927401123046785&EditedZoom=13&EditedCenter[lat]=57.14955426557916&EditedCenter[lng]=-2.0927401123046785'
api_url = 'https://api.aspc.co.uk/Property/GetProperties?{}&Sort=PublishedDesc&Page=1&PageSize=12'
params = url.split('?')[-1]
data = requests.get(api_url.format(params)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4)) # <-- uncomment this to see all data received from server
# print some data to screen:
for property_ in data:
print(property_['Location']['AddressLine1'])
print(property_['CategorisationDescription'])
print('Bedrooms:', property_["Bedrooms"]) # <-- print number of Bedrooms
print('Bathrooms:', property_["Bathrooms"]) # <-- print number of Bathrooms
print('PublicRooms:', property_["PublicRooms"]) # <-- print number of PublicRooms
# .. etc.
print('-' * 80)
印刷:
44 Roslin Place
Fully furnished 2 Bdrm 1st flr Flat. Hall. Lounge. Dining kitch. 2 Bdrms. Bathrm (CT band - C). Deposit 1 months rent. Parking. No pets. No smokers. Rent £550 p.m Entry by arr. Viewing contact solicitors. Landlord reg: 871287/100/26061. (EPC band - B).
Bedrooms: 2
Bathrooms: 1
PublicRooms: 1
--------------------------------------------------------------------------------
Second Floor Left, 173 Victoria Road
Unfurnished 1 Bdrm 2nd flr Flat. Hall. Lounge. Dining kitch. Bdrm. Bathrm (CT Band - A). Deposit 1 months rent. No pets. No smokers. Rent £375 p.m Immed entry. Viewing contact solicitors. Landlord reg: 1261711/100/09072. (EPC band - D).
Bedrooms: 1
Bathrooms: 1
PublicRooms: 1
--------------------------------------------------------------------------------
102 Bedford Road
Fully furnished 3 Bdrm 1st flr Flat. Hall. Lounge. Kitch. 3 Bdrms. Bathrm (CT band - B). Deposit 1 months rent. Garden. HMO License. No pets. No smokers. Rent £750 p.m Entry by arr. Viewing contact solicitors. Landlord reg: 49171/100/27130. (EPC band - D).
Bedrooms: 3
Bathrooms: 1
PublicRooms: 1
--------------------------------------------------------------------------------
... and so on.
推荐阅读
- java - 如何使用 CSVWriter 仅在 csv 中嵌入的逗号字符串上使用双引号?
- javascript - 带有 React Nextjs 的粘性导航栏
- c# - jQuery Array 未发布到 ASP.NET MVC 控制器
- javascript - 如何使一个复选框控制其他复选框,并且每个复选框都有自己的设置来控制
- c# - 使用 SSIS 脚本任务摆脱存储为文本的数字
- node.js - 是否可以使用 pg-promise 在一次往返中获得 2 个查询的结果?
- javascript - 未捕获的语法错误:无法在模块外使用 import 语句
- bitbucket-pipelines - 使用来自 Bitbucket Pipelines 的 docker-maven-plugin 将图像推送到 DockerHub
- python-3.x - Python GUI 计算器退格和清除
- javascript - 在组件中使用时 Nuxt 内容热重载 yaml 文件