python-3.x - 无法提取网页的 HTML 源代码 (BeautifulSoup)
问题描述
页面源代码:查看源代码:https ://www.myhome.ie/residential/dublin/property-for-sale
import requests, lxml
from bs4 import BeautifulSoup
url = "https://www.myhome.ie/residential/dublin/property-for-sale"
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')
print(soup)
# paging = soup.find_all("div",{"class":"PropertyInfoStrip ng-star-inserted"})
我需要获取源代码的 html,以便我可以抓取 div 类,但使用 bs4 它只显示 JS 脚本,我无法提取任何 HTML?我究竟做错了什么。通过浏览器检查元素时,我可以看到 HTML
解决方案
数据是通过 JavaScript 动态加载的,所以beautifulsoup
看不到它。您可以发出 Ajax 请求以获取 Json 格式的数据:
import json
import requests
params = {
"ApiKey": "5f4bc74f-8d9a-41cb-ab85-a1b7cfc86622",
"CorrelationId": "e4e14c46-53e6-463f-9bdc-f67785bd4915",
"SessionId": None,
"RequestTypeId": 2,
"RequestVerb": "POST",
"Endpoint": "https://api.myhome.ie/search",
"Page": 1,
"PageSize": 20,
"SortColumn": 2,
"SortDirection": 2,
"SearchRequest": {
"IsBackendSearch": False,
"SkipSearchIndex": False,
"IsGroupPrivateSearch": False,
"IsSaleAgreed": False,
"IsSold": False,
"IsAuction": False,
"IsBoundsSearch": False,
"UseFreeTextSearchForKeywords": False,
"SearchContent": False,
"PropertyIds": [],
"GroupIds": [],
"ChannelIds": [1],
"PropertyTypeIds": [],
"PropertyClassIds": [1],
"PropertyStatusIds": [2, 12],
"SaleTypeIds": [],
"FeatureTypeIds": [],
"RegionId": 1265,
"LocalityIds": [],
"LocalityNames": [],
"NegotiatorIds": [],
"SolicitorIds": [],
"BuyerSolicitorIds": [],
"VendorSolicitorIds": [],
"TransferedByUserIds": [],
"RowStatusIds": [2],
"EnergyRatings": [],
"Polygons": [],
"Destinations": [],
"Tags": [],
"PrivateTags": [],
"PreSixtyThree": False,
"IsActive": True,
"HasPhotos": False,
"PriceFrequency": "Monthly",
},
}
url = "https://api.myhome.ie/search"
params["Page"] = 1 # <--- change to desired page
data = requests.post(url, json=params).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some results to screen
for result in data["SearchResults"]:
print("{:<30} {}".format(result["PropertyType"], result["DisplayAddress"]))
印刷:
Semi-Detached House 248 Swords Road, Whitehall, Dublin 9, D09 K8W7
Apartment 24 Mountfield Park, Malahide, County Dublin
Semi-Detached House 26 Griffeen Glen Boulevard, Lucan, Co. Dublin
Semi-Detached House 4 Bedroom Home at Skylark, St. Marnock's Bay, Portmarnock, Dublin
Terraced House 250 Laraghcon, Lucan, Co. Dublin
Bungalow 1 Castleland Park View, Balbriggan, County Dublin
Semi-Detached House 657 Whitechurch Road, Taylors Lane, Rathfarnham, Dublin 14
Terraced House 22 Reuben Avenue, Rialto, Dublin 8
Semi-Detached House Merrion Lodge, 135 Mount Merrion Avenue, Blackrock, Co. Dublin
Terraced House 74 Seapark Drive, Clontarf, Dublin 3
Terraced House 5 O'Daly Road, Drumcondra, Dublin 9
Detached House Churchtown House, Weston Park, Dublin 14, Dublin
Detached House St. Kevins, 17 Rathfarnham Park, D14, Dublin 14, Dublin
Terraced House The Terrace, Foxrock, Dublin 18
Terraced House 7 Whately Place, Kilmacud Road Upper, Stillorgan, Co. Dublin
Detached House The Cottage, Dublin Road, Oldtown, County Dublin
Semi-Detached House 31 Gleann Na Smol, Oldbawn, Dublin 24
Terraced House 218 Castlecurragh Heath, Mulhuddart, Dublin 15
Semi-Detached House 19 Woodside, Dodder Park Road, Rathfarnham, Dublin 14
Apartment Apartment, 46 Slade Castle Court, Saggart, Co. Dublin
推荐阅读
- python-3.x - 无法从张量流数据集中加载数据
- ios - Linking.openURL(url) 在 iOS 上不起作用(尝试打开 Whatsapp)
- storage - 在虚拟主机上将用户文件保存在哪里?
- telerik - Telerik 从 2015.1.401.45 更新到 2020.2.512.45 后出现错误
- javascript - 类型注解只能在 TypeScript 文件中使用。ts(8010)
- react-native - Expo Barcode Scanner 仅扫描二维码
- c# - EF Core 3.1.3:将 .Include() 与基本父/子关系表一起使用时生成额外的列
- django - 无法在 django 中使用 ajax 发布请求检索表单数据
- machine-learning - 为什么不累积查询损失,然后使用 Pytorch 和更高版本在 MAML 中求导?
- r - 使用 cat 和 sprintf 在引号中打印文件名