beautifulsoup - 从网站解析的数据最终是一个空数据(bs4、python、lxml)
问题描述
你好 Stackoverflow 的人,
我在使用 Beautifulsoup 和 lxml 解析来自网站的信息时遇到了困难。
我试图从“ https://www1.nyc.gov/events/events-filter.html#page-1 ”网站获取地址数据。
据我谷歌搜索,
它说我需要1.通过网页的“检查”找到信息的特定类。2.需要写一个类似的代码g_data = soup.find_all("div", {"class": "event-data-detail"})
所以我写了我的代码如下。
import requests
from bs4 import BeautifulSoup
url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
g_data = soup.find_all("div", {"class": "event-data-detail"})
print(g_data)
它显示错误消息
警告(来自警告模块):文件“C:/Users/jotna/Desktop/Portfolio/1.py”,第 7 行汤 = BeautifulSoup(r.content) UserWarning:没有明确指定解析器,所以我使用的是最好的此系统可用的 HTML 解析器(“lxml”)。这通常不是问题,但如果您在另一个系统或不同的虚拟环境中运行此代码,它可能会使用不同的解析器并表现不同。
导致此警告的代码位于文件 C:/Users/jotna/Desktop/Portfolio/1.py 的第 7 行。要消除此警告,请将附加参数 'features="lxml"' 传递给 BeautifulSoup 构造函数。
所以我修复了如下代码。(导致stackoverflow中的帖子建议在最后添加lxml代码)
import lxml
import requests
from bs4 import BeautifulSoup
url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "http" in link.get("href"):
print ("<a href='%s'>%s</a>" %(link.get("href"), link.text))
g_data = soup.find_all("div", {"span class": "address"})
print(g_data)
但是它只显示空括号 []
怎样才能真正从网站上带上地址数据呢?
为了您的信息,我也上传了网页源的屏幕截图。 在此处输入图像描述
解决方案
使用他们的 json api 而不是 bs4,请参见下面的代码。
import requests
count = 0
for i in range(185):
count+=1
link = 'https://www1.nyc.gov/calendar/api/json/search.htm?&sort=DATE&pageNumber='+str(count)
req = requests.get(link)
for i in req.json()['items']:
address = (i['address'])
print 'Address:', address
输出
Address: Mulberry Street, Little Italy, New York
Address: Various locations Citywide
Address: SECOND AVENUE between EAST 32 STREET and EAST 33 STREET Manhattan
Address: FIRST AVENUE between EAST 92 STREET and EAST 93 STREET Manhattan
Address: CARROLL STREET between SMITH STREET and COURT STREET Brooklyn
Address: BROADWAY between WEST 114 STREET and WEST 116 STREET Manhattan
Address: CORTELYOU ROAD between RUGBY ROAD and ARGYLE ROAD Brooklyn
Address: QUEENS BOULEVARD between 70 AVENUE and 69 ROAD Queens
Address: 79 STREET between NORTHERN BOULEVARD and 34 AVENUE Queens
Address: PRINCE STREET between MOTT STREET and MULBERRY STREET Manhattan
Address: BUSHWICK AVENUE between NOLL STREET and ARION PLACE Brooklyn
Address: Alley Pond Park Adventure Center
Address: Atlantic Avenue between 4th Avenue and Hicks Street
Address: Alexander von Humboldt statue - Central Park West and 77th Street
Address: SEVENTH AVENUE between WEST 110 STREET and WEST 111 STREET Manhattan
Address: Wave Hill House - West 249th Street and Independence Avenue
Address: Broadway between Liberty Street and Rector Street
Address: Anibal Aviles Playground
Address: Myrtle Avenue between Fresh Pond Road and Wyckoff Avenue
推荐阅读
- javascript - 尝试使用量角器进行 SSH
- c# - ASP.NET CORE 没有 app.UseEndpoints() 方法
- jenkins - Jenkins Pipeline General Build Step - 动态传递值
- flutter - 导航器观察者在打开对话框时观察推送事件
- java - 在 Java 中使用子类项列表时如何避免向下转换
- database - URI 格式错误,无法解析 - 使用 MongoDB Compass 连接字符串连接到 mongdb 时
- c# - 为什么访问密钥在这种情况下不起作用
- firebase - Firebase 托管 Flutter Web 应用程序未清除首次部署的缓存
- mysql - 无法绑定多部分标识符“Interfaces.Availability”
- node.js - Seconday 节点关闭时出现 Mongo Db 复制错误