首页 > 解决方案 > 从网站解析的数据最终是一个空数据(bs4、python、lxml)

问题描述

你好 Stackoverflow 的人,

我在使用 Beautifulsoup 和 lxml 解析来自网站的信息时遇到了困难。

我试图从“ https://www1.nyc.gov/events/events-filter.html#page-1 ”网站获取地址数据。

据我谷歌搜索,

它说我需要1.通过网页的“检查”找到信息的特定类。2.需要写一个类似的代码g_data = soup.find_all("div", {"class": "event-data-detail"})

所以我写了我的代码如下。

import requests
from bs4 import BeautifulSoup

url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")

soup = BeautifulSoup(r.content)


links = soup.find_all("a")

g_data = soup.find_all("div", {"class": "event-data-detail"})

print(g_data)

它显示错误消息

警告(来自警告模块):文件“C:/Users/jotna/Desktop/Portfolio/1.py”,第 7 行汤 = BeautifulSoup(r.content) UserWarning:没有明确指定解析器,所以我使用的是最好的此系统可用的 HTML 解析器(“lxml”)。这通常不是问题,但如果您在另一个系统或不同的虚拟环境中运行此代码,它可能会使用不同的解析器并表现不同。

导致此警告的代码位于文件 C:/Users/jotna/Desktop/Portfolio/1.py 的第 7 行。要消除此警告,请将附加参数 'features="lxml"' 传递给 BeautifulSoup 构造函数。

所以我修复了如下代码。(导致stackoverflow中的帖子建议在最后添加lxml代码)

import lxml
import requests
from bs4 import BeautifulSoup

url = "https://www1.nyc.gov/events/events-filter.html#page-1"
r=requests.get("https://www1.nyc.gov/events/events-filter.html#page-1")

soup = BeautifulSoup(r.content)


links = soup.find_all("a")

for link in links:
   if "http" in link.get("href"):
       print ("<a href='%s'>%s</a>" %(link.get("href"), link.text))

g_data = soup.find_all("div", {"span class": "address"})

print(g_data)

但是它只显示空括号 []

怎样才能真正从网站上带上地址数据呢?

为了您的信息,我也上传了网页源的屏幕截图。 在此处输入图像描述

标签: beautifulsouplxml

解决方案


使用他们的 json api 而不是 bs4,请参见下面的代码。

import requests
count = 0
for i in range(185):
    count+=1
    link = 'https://www1.nyc.gov/calendar/api/json/search.htm?&sort=DATE&pageNumber='+str(count)
    req = requests.get(link)
    for i in req.json()['items']:
        address = (i['address'])
        print 'Address:', address

输出

Address: Mulberry Street, Little Italy, New York
Address: Various locations Citywide
Address:  SECOND AVENUE between EAST   32 STREET and EAST   33 STREET  Manhattan
Address:  FIRST AVENUE between EAST   92 STREET and EAST   93 STREET  Manhattan
Address:  CARROLL STREET between SMITH STREET and COURT STREET  Brooklyn
Address:  BROADWAY between WEST  114 STREET and WEST  116 STREET  Manhattan
Address:  CORTELYOU ROAD between RUGBY ROAD and ARGYLE ROAD  Brooklyn
Address:  QUEENS BOULEVARD between 70 AVENUE and 69 ROAD  Queens
Address:  79 STREET between NORTHERN BOULEVARD and 34 AVENUE  Queens
Address:  PRINCE STREET between MOTT STREET and MULBERRY STREET  Manhattan
Address:  BUSHWICK AVENUE between NOLL STREET and ARION PLACE  Brooklyn
Address: Alley Pond Park Adventure Center
Address: Atlantic Avenue between 4th Avenue and Hicks Street
Address: Alexander von Humboldt statue - Central Park West and 77th Street
Address:  SEVENTH AVENUE between WEST  110 STREET and WEST  111 STREET  Manhattan
Address: Wave Hill House - West 249th Street and Independence Avenue
Address: Broadway between Liberty Street and Rector Street
Address: Anibal Aviles Playground
Address: Myrtle Avenue between Fresh Pond Road and Wyckoff Avenue

推荐阅读