python - 解析没有 / 标记的表并且数据嵌套在标签 - beautifulsoup、selenium 和 webdriver_manager
我正在尝试获取此 url =“https://www.topuniversities.com/university-rankings/university-subject-rankings/2021/psychology”中的所有表格。问题是没有table
标签,也没有<tr>
问题描述
我正在尝试获取此 url =“https://www.topuniversities.com/university-rankings/university-subject-rankings/2021/psychology”中的所有表格。问题是没有table
标签,也没有<tr>
标签<td>
。行中的所有数据都在嵌套的“div”标签中。我正在使用的代码是这样的:
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
import time
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
driver.quit()
print(soup)
<div>
此外,我只从嵌套标签中的一列(名为“总分”的列)获取数据。我意识到的另一件事是输出中只有前 10 行的数据soup
,但我正在尝试获取所有 302 行数据。
非常感谢您给我的任何建议。
编辑
我设法得到了我所期望@KunduK
的答案。这是我最后使用的代码:
res = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089_indicators.txt?1614801117').json()
df = pd.DataFrame(res["data"])
df = df[["uni", "region", "location", "city", "overall",
"ind_69", "ind_70", "ind_76", "ind_77"]]
headers = {"uni":"University", "overall": "Overall Score", "ind_69": "H-index Citations",
"ind_70": "Citations per Paper", "ind_76": "Academic Reputation", "ind_77": "Employer Reputation"}
df.rename(columns=headers, inplace=True)
for column in headers.values():
df[column] = df[column].apply(lambda value: BeautifulSoup(value, 'html.parser').find('div').text)
df
我已经检查了您提供的 URL。似乎数据(从 XHR 请求 @ https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1616049862?v=1616050007711收到)按分页拆分,这就是为什么您只看到它的 10 个条目的原因。
您有两种选择来处理此问题:
- 模拟单击下一页按钮
- 以 JSON 格式从 XHR URL 读取完整数据
解决方案
如果您转到网络选项卡,则不需要 selenium,您将获得以下链接,该链接以 json 形式返回数据。您需要遍历它并获取值。
代码:
import requests
import json
res=requests.get("https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1615516693?v=1616064930668").json()
print("Total records :{}".format(len(res['data'])))
for item in res['data']:
print(item['country'])
print(item['city'])
print(item['score'])
print("============")
输出:
Total records :302
United States
Cambridge
98.6
============
United States
Stanford
96.4
============
United Kingdom
Oxford
95.5
============
United Kingdom
Cambridge
94.8
============
United States
Berkeley
92.3
============
United States
Los Angeles
91.4
============
United States
New Haven
90.9
============
United States
Ann Arbor
89.5
============
United States
Cambridge
89.3
============
United Kingdom
London
89.2
============
United States
Philadelphia
89.2
============
United States
New York City
89.1
============
United States
New York City
88.4
============
United States
Chicago
88.2
============
Netherlands
Amsterdam
87.7
============
Singapore
Singapore
87.2
============
Canada
Vancouver
87.2
============
United States
Princeton
87
============
Canada
Toronto
86.1
============
United Kingdom
London
85.7
============
Australia
Parkville
85.7
============
United States
Evanston
85.5
============
Belgium
Leuven
85.2
============
United Kingdom
London
85.1
============
Australia
Sydney
85.1
============
Australia
Brisbane
84.4
============
Singapore
Singapore
84.3
============
United States
Durham
83.6
============
Canada
Montreal
83.5
============
Australia
Sydney
83.4
============
Netherlands
Utrecht
82.9
============
United States
Champaign
82.7
============
United Kingdom
Edinburgh
82.5
============
United Kingdom
Manchester
81.7
============
Hong Kong SAR
Hong Kong
81.7
============
United States
Austin
81.6
============
United States
Pittsburgh
81.5
============
Australia
Canberra
81.3
============
Netherlands
Rotterdam
81.2
============
United States
East Lansing
81.1
============
Germany
Berlin
81
============
Australia
Perth
81
============
Germany
Berlin
80.9
============
Netherlands
Groningen
80.9
============
United States
Ithaca
80.7
============
Hong Kong SAR
Hong Kong
80.4
============
United States
Madison
80.4
============
United States
Columbus
80.3
============
Switzerland
Zürich
80.3
============
United States
San Diego
80.2
============
Australia
Melbourne
80.1
============
Netherlands
Leiden
79.8
============
United States
Seattle
79.8
============
Netherlands
Tilburg
79.6
============
United States
Minneapolis
79.5
============
China (Mainland)
Beijing
79.4
============
New Zealand
Auckland
79.3
============
Netherlands
Maastricht
79.1
============
United States
University Park
79.1
============
United States
Chapel Hill
79.1
============
Belgium
Louvain-la-Neuve
78.9
============
Netherlands
Nijmegen
78.5
============
United Kingdom
Coventry
78.5
============
United States
Nashville
78.5
============
Netherlands
Amsterdam
78.5
============
United States
Baltimore
78.4
============
United Kingdom
Exeter
78.3
============
United States
College Park
78.3
============
United Kingdom
Cardiff
78.2
============
Germany
Munich
78.2
============
Chile
Santiago
78.1
============
New Zealand
Kelburn, Wellington
78.1
============
United States
Providence
78
============
Australia
Sydney
77.8
============
Belgium
Ghent
77.8
============
United States
Boston
77.3
============
United States
Los Angeles
77.3
============
Japan
Tokyo
77.1
============
United Kingdom
Birmingham
77.1
============
United Kingdom
Bristol
77
============
New Zealand
Dunedin
77
============
China (Mainland)
Beijing
76.9
============
Italy
Rome
76.9
============
Italy
Padua
76.9
============
United States
Charlottesville
76.9
============
Sweden
Stockholm
76.8
============
Spain
Madrid
76.8
============
United Kingdom
York
76.8
============
United States
Phoenix
76.6
============
Denmark
Aarhus
76.5
============ so on..
网络选项卡
推荐阅读
- ssl - What are the supported cipher suites in sim800c?
- java - 我如何使用 TextView 修复 AsyncTask 错误
- javascript - 重定向到新页面时代码未执行
- java - Java Swing - 在 JPanel 中获取源代码
- python-3.x - 在运行测试时让 Pytest 在路径中包含 prod-dir 对我来说并不明显
- agora.io - Agora 多对一直播
- python - 在具有范围的 for 循环的最后一次迭代中获取余数
- html - 如何使用 CSS 创建父图像并缩放子图像?
- r - 将函数中的变量传递给R中的其他函数变量
- node.js - 如何在节点 js 中为 oracledb 编写更新查询
我正在尝试获取此 url =“https://www.topuniversities.com/university-rankings/university-subject-rankings/2021/psychology”中的所有表格。问题是没有table
标签,也没有<tr>
问题描述
我正在尝试获取此 url =“https://www.topuniversities.com/university-rankings/university-subject-rankings/2021/psychology”中的所有表格。问题是没有table
标签,也没有<tr>
标签<td>
。行中的所有数据都在嵌套的“div”标签中。我正在使用的代码是这样的:
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
import time
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
driver.quit()
print(soup)
<div>
此外,我只从嵌套标签中的一列(名为“总分”的列)获取数据。我意识到的另一件事是输出中只有前 10 行的数据soup
,但我正在尝试获取所有 302 行数据。
非常感谢您给我的任何建议。
编辑
我设法得到了我所期望@KunduK
的答案。这是我最后使用的代码:
res = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089_indicators.txt?1614801117').json()
df = pd.DataFrame(res["data"])
df = df[["uni", "region", "location", "city", "overall",
"ind_69", "ind_70", "ind_76", "ind_77"]]
headers = {"uni":"University", "overall": "Overall Score", "ind_69": "H-index Citations",
"ind_70": "Citations per Paper", "ind_76": "Academic Reputation", "ind_77": "Employer Reputation"}
df.rename(columns=headers, inplace=True)
for column in headers.values():
df[column] = df[column].apply(lambda value: BeautifulSoup(value, 'html.parser').find('div').text)
df
我已经检查了您提供的 URL。似乎数据(从 XHR 请求 @ https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1616049862?v=1616050007711收到)按分页拆分,这就是为什么您只看到它的 10 个条目的原因。
您有两种选择来处理此问题:
- 模拟单击下一页按钮
- 以 JSON 格式从 XHR URL 读取完整数据
解决方案
如果您转到网络选项卡,则不需要 selenium,您将获得以下链接,该链接以 json 形式返回数据。您需要遍历它并获取值。
代码:
import requests
import json
res=requests.get("https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1615516693?v=1616064930668").json()
print("Total records :{}".format(len(res['data'])))
for item in res['data']:
print(item['country'])
print(item['city'])
print(item['score'])
print("============")
输出:
Total records :302
United States
Cambridge
98.6
============
United States
Stanford
96.4
============
United Kingdom
Oxford
95.5
============
United Kingdom
Cambridge
94.8
============
United States
Berkeley
92.3
============
United States
Los Angeles
91.4
============
United States
New Haven
90.9
============
United States
Ann Arbor
89.5
============
United States
Cambridge
89.3
============
United Kingdom
London
89.2
============
United States
Philadelphia
89.2
============
United States
New York City
89.1
============
United States
New York City
88.4
============
United States
Chicago
88.2
============
Netherlands
Amsterdam
87.7
============
Singapore
Singapore
87.2
============
Canada
Vancouver
87.2
============
United States
Princeton
87
============
Canada
Toronto
86.1
============
United Kingdom
London
85.7
============
Australia
Parkville
85.7
============
United States
Evanston
85.5
============
Belgium
Leuven
85.2
============
United Kingdom
London
85.1
============
Australia
Sydney
85.1
============
Australia
Brisbane
84.4
============
Singapore
Singapore
84.3
============
United States
Durham
83.6
============
Canada
Montreal
83.5
============
Australia
Sydney
83.4
============
Netherlands
Utrecht
82.9
============
United States
Champaign
82.7
============
United Kingdom
Edinburgh
82.5
============
United Kingdom
Manchester
81.7
============
Hong Kong SAR
Hong Kong
81.7
============
United States
Austin
81.6
============
United States
Pittsburgh
81.5
============
Australia
Canberra
81.3
============
Netherlands
Rotterdam
81.2
============
United States
East Lansing
81.1
============
Germany
Berlin
81
============
Australia
Perth
81
============
Germany
Berlin
80.9
============
Netherlands
Groningen
80.9
============
United States
Ithaca
80.7
============
Hong Kong SAR
Hong Kong
80.4
============
United States
Madison
80.4
============
United States
Columbus
80.3
============
Switzerland
Zürich
80.3
============
United States
San Diego
80.2
============
Australia
Melbourne
80.1
============
Netherlands
Leiden
79.8
============
United States
Seattle
79.8
============
Netherlands
Tilburg
79.6
============
United States
Minneapolis
79.5
============
China (Mainland)
Beijing
79.4
============
New Zealand
Auckland
79.3
============
Netherlands
Maastricht
79.1
============
United States
University Park
79.1
============
United States
Chapel Hill
79.1
============
Belgium
Louvain-la-Neuve
78.9
============
Netherlands
Nijmegen
78.5
============
United Kingdom
Coventry
78.5
============
United States
Nashville
78.5
============
Netherlands
Amsterdam
78.5
============
United States
Baltimore
78.4
============
United Kingdom
Exeter
78.3
============
United States
College Park
78.3
============
United Kingdom
Cardiff
78.2
============
Germany
Munich
78.2
============
Chile
Santiago
78.1
============
New Zealand
Kelburn, Wellington
78.1
============
United States
Providence
78
============
Australia
Sydney
77.8
============
Belgium
Ghent
77.8
============
United States
Boston
77.3
============
United States
Los Angeles
77.3
============
Japan
Tokyo
77.1
============
United Kingdom
Birmingham
77.1
============
United Kingdom
Bristol
77
============
New Zealand
Dunedin
77
============
China (Mainland)
Beijing
76.9
============
Italy
Rome
76.9
============
Italy
Padua
76.9
============
United States
Charlottesville
76.9
============
Sweden
Stockholm
76.8
============
Spain
Madrid
76.8
============
United Kingdom
York
76.8
============
United States
Phoenix
76.6
============
Denmark
Aarhus
76.5
============ so on..
网络选项卡
推荐阅读
- ssl - What are the supported cipher suites in sim800c?
- java - 我如何使用 TextView 修复 AsyncTask 错误
- javascript - 重定向到新页面时代码未执行
- java - Java Swing - 在 JPanel 中获取源代码
- python-3.x - 在运行测试时让 Pytest 在路径中包含 prod-dir 对我来说并不明显
- agora.io - Agora 多对一直播
- python - 在具有范围的 for 循环的最后一次迭代中获取余数
- html - 如何使用 CSS 创建父图像并缩放子图像?
- r - 将函数中的变量传递给R中的其他函数变量
- node.js - 如何在节点 js 中为 oracledb 编写更新查询