python - BeautifulSoup 没有获取网络数据
问题描述
我正在创建一个网络爬虫,以便从商会网站目录中提取公司名称。
我正在使用 BeautifulSoup。页面和汤对象似乎正在工作,但是当我抓取 HTML 内容时,当它应该填充页面上的目录名称时,会返回一个空列表。
试图抓取的网页:https ://www.austinchamber.com/directory
这是HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
这是python代码:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)
解决方案
数据通过 Ajax 动态加载。要获取数据,您可以使用以下脚本:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
印刷:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...
推荐阅读
- omnet++ - 如何从 MAC 层读取 RREQ 消息头
- google-cloud-platform - GCP IAM 访问被拒绝:来自另一个域的用户无权访问 Org 下的查询 BQ
- javascript - 延迟的jquery ajax
- python - Prophet - 饱和度预测最小值低于 0
- excel - 动态 SUM 函数和过滤
- angular - 使用 NgRx 时,每个 API 调用都应该派发一个动作吗?
- docker - 在 GitLab CI 中使用 Paketo.io / CloudNativeBuildpacks (CNB) 与 Kubernetes 执行器和非特权 Runners(没有包 CLI 和 docker)
- python - 在列表中排序(order_by)
- sql - SQL注入中间查询
- mongodb - 如何 $unset 一个嵌入式字段 MongoDB?