首页 > 解决方案 > 网页抓取 - 页面源中未显示的内容

问题描述

我正在尝试从网站上抓取信息:https ://foreclosures.cabarruscounty.us/ 。所有数据似乎都是在重复卡中生成的,但是当我查看页面源时找不到信息。我曾尝试使用 Selenium 等 Web 驱动程序,但仍然无法看到我希望抓取的内容。我希望能够提取每个条目的所有重复数据。

driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)

url = 'https://foreclosures.cabarruscounty.us/'

driver.get(url)

web_url = driver.page_source
soup = bs.BeautifulSoup(web_url, 'html.parser')
print(soup)

我如何能够访问或查看重复卡本身的内容?

标签: pythonseleniumweb-scraping

解决方案


您看到的数据是从外部 URL 加载的,您只能使用requests模块来获取它:

import json
import requests


url = 'https://foreclosures.cabarruscounty.us/dataForeclosures.json'
data = requests.get(url).json()

# uncomment this to see all data:
# print(json.dumps(data, indent=4)

# print some data to screen:
for d in data:
    for k, v in d.items():
        print('{:<5}: {}'.format(k, v))
    print('-' * 80)

印刷:

ID   : 2062
TM   : 04-086 -0010.00
S    : COMPLAINT/JUDGMENT
C    : 20-CVD-1754
R    : 56235032510000
T    : 14,850
O    : W O L INC A NC CORPORATION
M    : 3,703
SD   : PENDING
ST   : PENDING
D    : S/S DALE EARNHARDT BLVD
A    : ZACCHAEUS LEGAL SVCS
CO   : www.zls-nc.com
SL   : 77 UNION ST S CONCORD NC 28025
SP   : COURTHOUSE STEPS
U    : https://foreclosures.cabarruscounty.us/PropertyPhotos/2062.jpg
OR   : 3
--------------------------------------------------------------------------------
ID   : 2061
TM   : 04-007 -0006.00
S    : COMPLAINT/JUDGMENT
C    : 20-CVD-1070
R    : 56036654730000
T    : 135,190
O    : PITTS H M PITTS H M ESTATE
M    : 9,475
SD   : PENDING
ST   : PENDING
D    : SOUTH SIDE MOORESVILLE RD
A    : ZACCHAEUS LEGAL SVCS
CO   : www.zls-nc.com
SL   : 77 UNION ST S CONCORD NC 28025
SP   : COURTHOUSE STEPS
U    : https://foreclosures.cabarruscounty.us/PropertyPhotos/2061.jpg
OR   : 3
--------------------------------------------------------------------------------

...and so on.

推荐阅读