首页 > 解决方案 > 使用 Python Selenium 抓取动态网站

问题描述

我尝试通过 BS4 python 抓取动态网站:

https://www.nadlan.gov.il/?search=%D7%AA%D7%9C%20%D7%90%D7%91%D7%99%D7%91%20%20%D7%99% D7%A4%D7%95

我试过:

from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(wiki)
soup = BeautifulSoup("https://www.nadlan.gov.il/?search=תל אביב יפו")

我有两个问题:

  1. 该网站是动态的,当我查看页面源时,我看不到页面内容,只有 JavaScript 脚本:

    <script>
        document.write("<script src='scripts/dis/bundleJS.js?v=" + globalAppVersion + "'><\/script>")
        document.write("<script id='srcGovmap' src='https://new.govmap.gov.il/govmap/api/govmap.api.js?v='" + globalAppVersion + "'><\/script>")
        document.write("<script src='MainLoader.js?v=" + globalAppVersion + "'><\/script>")
        document.write("<script id='tld-search-srcipt' 
    src='https://www.nadlan.gov.il/TldSearch/Scripts/ac.js?v=" + globalAppVersion + "'><\/script>");
    </script>
    
    <script src="scripts/dis/accessibility/b1.js?v=3" type="text/javascript"></script>
    
    <script type="text/javascript">
    
    accessibility_rtl = true;
    pixel_from_side = 20;
    pixel_from_start = 15;
    
    $(document).ready(function () {
        $('#accessibility_icon').attr('src', 'images/accessibility_icon.png')
        $('.accessibility_div_wrap>.btn_accessibility > span.accessibility_component').html('')
    });
    
  2. 当我打开网站时,数据加载需要几秒钟:

在此处输入图像描述

Selenium 如何解决这些问题?

标签: pythonseleniumweb-scrapingbeautifulsoup

解决方案


数据通过 JavaScript 动态加载。requests您可以使用/ jsonmodules模拟 Ajax 调用。例如:

import json
import requests


url = 'https://www.nadlan.gov.il/Nadlan.REST/Main/GetAssestAndDeals'
data = {"MoreAssestsType":0,"FillterRoomNum":0,"GridDisplayType":0,"ResultLable":"תל אביב -יפו","ResultType":1,"ObjectID":"5000","ObjectIDType":"text","ObjectKey":"UNIQ_ID","DescLayerID":"SETL_MID_POINT","Alert":None,"X":180428.31832654,"Y":665726.5550939,"Gush":"","Parcel":"","showLotParcel":False,"showLotAddress":False,"OriginalSearchString":"תל אביב  יפו","MutipuleResults":False,"ResultsOptions":None,"CurrentLavel":2,"Navs":[{"text":"מחוז תל אביב - יפו","url":None,"order":1}],"QueryMapParams":{"QueryToRun":None,"QueryObjectID":"5000","QueryObjectType":"number","QueryObjectKey":"SETL_CODE","QueryDescLayerID":"KSHTANN_SETL_AREA","SpacialWhereClause":None},"isHistorical":False,"PageNo":1,"OrderByFilled":"DEALDATETIME","OrderByDescending":True,"Distance":0}
result = requests.post(url, json=data).json()

# uncomment this to print all data:
# print(json.dumps(result, indent=4))

# print all results to screen:
for r in result['AllResults']:
    for k, v in r.items():
        print('{:<30} {}'.format(k, v))
    print('-' * 80)

印刷:

DEALDATE                       12.12.2020
DEALDATETIME                   2020-12-12T00:00:00
FULLADRESS                     
DISPLAYADRESS                  
GUSH                           7104-289-264
DEALNATUREDESCRIPTION          דירה
ASSETROOMNUM                   3
FLOORNO                        None
DEALNATURE                     90
DEALAMOUNT                     3,650,000
NEWPROJECTTEXT                 1
PROJECTNAME                    מגדלי גינדי תל אביב
BUILDINGYEAR                   None
YEARBUILT                      
BUILDINGFLOORS                 None
KEYVALUE                       10812534855
TYPE                           2
POLYGON_ID                     7104-289
TREND_IS_NEGATIVE              False
TREND_FORMAT                   
--------------------------------------------------------------------------------
DEALDATE                       31.07.2020
DEALDATETIME                   2020-07-31T00:00:00
FULLADRESS                     עגנון ש"י 28, תל אביב -יפו
DISPLAYADRESS                  עגנון ש"י 28
GUSH                           6634-336-33
DEALNATUREDESCRIPTION          דירה
ASSETROOMNUM                   5
FLOORNO                        None
DEALNATURE                     130
DEALAMOUNT                     6,363,000
NEWPROJECTTEXT                 1
PROJECTNAME                    הפילהרמונית
BUILDINGYEAR                   2020
YEARBUILT                      
BUILDINGFLOORS                 9
KEYVALUE                       10812534851
TYPE                           1
POLYGON_ID                     6634-336
TREND_IS_NEGATIVE              False
TREND_FORMAT                   
--------------------------------------------------------------------------------


...and so on.

推荐阅读