python - 使用 BeautifulSoup 抓取 greatschools.org 会返回空列表
问题描述
我一直在学习如何使用 BeautifulSoup 抓取 greatschools.org 网站。尽管在这里和其他地方查找了不同的解决方案,但我已经陷入了死胡同。通过使用 chrome 上的“检查”功能,我可以看到该网站具有表格标签,但 find_all('tr') 或 find_all('table') 或 find_all('tbody') 返回一个空列表。我错过了什么?
这是我正在使用的代码块:
import requests
from bs4 import BeautifulSoup
url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?
tableView=Overview&view=table"
page_response = requests.get(url)
content = BeautifulSoup(page_response.text,"html.parser")
table=content.find_all('table')
table
输出是:[]
在此先感谢您的帮助。
解决方案
您可以使用Selenium,因为它看起来像页面是动态的。如果您愿意,您仍然可以使用 beautifulsoup 进行解析。当涉及到标签作为表格时,我选择使用 pandas 来读取 html。您必须做一些工作来拆分文本/列,以及第一列中不应该做的事情。)
让我知道这是否适合您。
import pandas as pd
from selenium import webdriver
url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)
html = driver.page_source
table = pd.read_html(html)
df = table[0]
driver.close()
输出
print (table[0])
School ... District
0 9/10Above averageSouthern Lehigh Intermediate ... ... Southern Lehigh School District
1 8/10Above averageHanover El School3890 Jackson... ... Bethlehem Area School District
2 8/10Above averageLehigh Valley Charter High Sc... ... Lehigh Valley Charter High School For The Arts
3 6/10AverageCalypso El School1021 Calypso Ave, ... ... Bethlehem Area School District
4 6/10AverageMiller Heights El School3605 Allen ... ... Bethlehem Area School District
5 6/10AverageAsa Packer El School1650 Kenwood Dr... ... Bethlehem Area School District
6 6/10AverageLehigh Valley Academy Regional Cs15... ... Lehigh Valley Academy Regional Cs
7 5/10AverageNortheast Middle School1170 Fernwoo... ... Bethlehem Area School District
8 5/10AverageNitschmann Middle School1002 West U... ... Bethlehem Area School District
9 5/10AverageThomas Jefferson El School404 East ... ... Bethlehem Area School District
10 4/10Below averageJames Buchanan El School1621 ... ... Bethlehem Area School District
11 4/10Below averageLincoln El School1260 Gresham... ... Bethlehem Area School District
12 4/10Below averageGovernor Wolf El School1920 B... ... Bethlehem Area School District
13 4/10Below averageSpring Garden El School901 No... ... Bethlehem Area School District
14 4/10Below averageClearview El School2121 Abing... ... Bethlehem Area School District
15 4/10Below averageLiberty High School1115 Linde... ... Bethlehem Area School District
16 4/10Below averageEast Hills Middle School2005 ... ... Bethlehem Area School District
17 4/10Below averageFreedom High School3149 Chest... ... Bethlehem Area School District
18 3/10Below averageMarvine El School1425 Livings... ... Bethlehem Area School District
19 3/10Below averageWilliam Penn El School1002 Ma... ... Bethlehem Area School District
20 3/10Below averageLehigh Valley Dual Language C... ... Lehigh Valley Dual Language Charter School
21 2/10Below averageBroughal Middle School114 Wes... ... Bethlehem Area School District
22 2/10Below averageDonegan El School1210 East 4t... ... Bethlehem Area School District
23 2/10Below averageFountain Hill El School1330 C... ... Bethlehem Area School District
24 Currently unratedSt. Anne School375 Hickory St... ... NaN
[25 rows x 7 columns]
现在,如果您仍然想使用 BeautifulSoup,因为也许您还试图拉出其中一些链接或表格中的其他标签(也许仅仅获取表格不足以满足您的需求?),您一旦你得到page_response
.
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.greatschools.org/pennsylvania/bethlehem/schools/?tableView=Overview&view=table"
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get(url)
page_response = driver.page_source
content = BeautifulSoup(page_response,'html.parser')
table=content.find_all('table')
table
driver.close()
推荐阅读
- php - 服务器端事件 / EventSource / PostgreSQL 通知 - 仅当在数据库中进行事务时
- machine-learning - 如何在pytorch中计算网络中所有参数的hessian矩阵?
- amazon-cloudformation - 尝试添加到资源策略时解决循环依赖关系
- javascript - 正则表达式不连贯的语法
- ajax - 通过选择下拉列表 ASP.NET Core MVC 从数据库中填充多个文本框
- javascript - 单击 td 时获取 HTML th 元素的值
- awesome-wm - 向客户端发送键盘输入
- java - Spring Cloud 任务上的 java.lang.ClassNotFoundException
- php - PHP Nullable 类型和函数参数
- javascript - 如何仅更改 ag-heder-column 标题?