python - 使用 Python Selenium 抓取动态网站
问题描述
我尝试通过 BS4 python 抓取动态网站:
https://www.nadlan.gov.il/?search=%D7%AA%D7%9C%20%D7%90%D7%91%D7%99%D7%91%20%20%D7%99% D7%A4%D7%95
我试过:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(wiki)
soup = BeautifulSoup("https://www.nadlan.gov.il/?search=תל אביב יפו")
我有两个问题:
该网站是动态的,当我查看页面源时,我看不到页面内容,只有 JavaScript 脚本:
<script> document.write("<script src='scripts/dis/bundleJS.js?v=" + globalAppVersion + "'><\/script>") document.write("<script id='srcGovmap' src='https://new.govmap.gov.il/govmap/api/govmap.api.js?v='" + globalAppVersion + "'><\/script>") document.write("<script src='MainLoader.js?v=" + globalAppVersion + "'><\/script>") document.write("<script id='tld-search-srcipt' src='https://www.nadlan.gov.il/TldSearch/Scripts/ac.js?v=" + globalAppVersion + "'><\/script>"); </script> <script src="scripts/dis/accessibility/b1.js?v=3" type="text/javascript"></script> <script type="text/javascript"> accessibility_rtl = true; pixel_from_side = 20; pixel_from_start = 15; $(document).ready(function () { $('#accessibility_icon').attr('src', 'images/accessibility_icon.png') $('.accessibility_div_wrap>.btn_accessibility > span.accessibility_component').html('') });
当我打开网站时,数据加载需要几秒钟:
Selenium 如何解决这些问题?
解决方案
数据通过 JavaScript 动态加载。requests
您可以使用/ json
modules模拟 Ajax 调用。例如:
import json
import requests
url = 'https://www.nadlan.gov.il/Nadlan.REST/Main/GetAssestAndDeals'
data = {"MoreAssestsType":0,"FillterRoomNum":0,"GridDisplayType":0,"ResultLable":"תל אביב -יפו","ResultType":1,"ObjectID":"5000","ObjectIDType":"text","ObjectKey":"UNIQ_ID","DescLayerID":"SETL_MID_POINT","Alert":None,"X":180428.31832654,"Y":665726.5550939,"Gush":"","Parcel":"","showLotParcel":False,"showLotAddress":False,"OriginalSearchString":"תל אביב יפו","MutipuleResults":False,"ResultsOptions":None,"CurrentLavel":2,"Navs":[{"text":"מחוז תל אביב - יפו","url":None,"order":1}],"QueryMapParams":{"QueryToRun":None,"QueryObjectID":"5000","QueryObjectType":"number","QueryObjectKey":"SETL_CODE","QueryDescLayerID":"KSHTANN_SETL_AREA","SpacialWhereClause":None},"isHistorical":False,"PageNo":1,"OrderByFilled":"DEALDATETIME","OrderByDescending":True,"Distance":0}
result = requests.post(url, json=data).json()
# uncomment this to print all data:
# print(json.dumps(result, indent=4))
# print all results to screen:
for r in result['AllResults']:
for k, v in r.items():
print('{:<30} {}'.format(k, v))
print('-' * 80)
印刷:
DEALDATE 12.12.2020
DEALDATETIME 2020-12-12T00:00:00
FULLADRESS
DISPLAYADRESS
GUSH 7104-289-264
DEALNATUREDESCRIPTION דירה
ASSETROOMNUM 3
FLOORNO None
DEALNATURE 90
DEALAMOUNT 3,650,000
NEWPROJECTTEXT 1
PROJECTNAME מגדלי גינדי תל אביב
BUILDINGYEAR None
YEARBUILT
BUILDINGFLOORS None
KEYVALUE 10812534855
TYPE 2
POLYGON_ID 7104-289
TREND_IS_NEGATIVE False
TREND_FORMAT
--------------------------------------------------------------------------------
DEALDATE 31.07.2020
DEALDATETIME 2020-07-31T00:00:00
FULLADRESS עגנון ש"י 28, תל אביב -יפו
DISPLAYADRESS עגנון ש"י 28
GUSH 6634-336-33
DEALNATUREDESCRIPTION דירה
ASSETROOMNUM 5
FLOORNO None
DEALNATURE 130
DEALAMOUNT 6,363,000
NEWPROJECTTEXT 1
PROJECTNAME הפילהרמונית
BUILDINGYEAR 2020
YEARBUILT
BUILDINGFLOORS 9
KEYVALUE 10812534851
TYPE 1
POLYGON_ID 6634-336
TREND_IS_NEGATIVE False
TREND_FORMAT
--------------------------------------------------------------------------------
...and so on.
推荐阅读
- vue.js - VueJs/Vuex - 设计模式
- r - 在 R 中使用 ggnewscale::new_scale() 和 ggplot2 将图例拆分为两个或多个列
- spring-integration - Spring Integration DSL:如何删除 JPA 实体列表?
- substrate - 存储值的多地址查找使用:`T::lookup:lookup(Vec<[u8; 32], Global>)`?
- excel - 如何使范围从下到上开始?VBA
- angular - 从搜索框输入中过滤 *ngFor 结果
- html - CSS 居中的方形子元素受其父元素的宽度和高度限制(如 background-size:contain + center)
- javascript - ChartJS - 在条形图/折线图上单独显示/隐藏数据而不是整个数据集
- react-native - 在状态变化中反应原生测试
- android - 已修复 - Android Studio Artic Fox - 每次 AS 重启后设备文件资源管理器丢失