python - 抓取具有链接 javascript:void() 的页面上的内容
问题描述
我想爬取https://www.gotouniversity.com/course/index的前十页。到目前为止,我已经能够掌握第一页上的内容。
from selenium import webdriver
driver = webdriver.Chrome(executable_path='/Users/xx/Desktop/chromedriver')
driver.get('https://www.gotouniversity.com/course/index')
university_name = driver.find_elements_by_class_name("university-name")
university_name = [link.text for link in university_name]
print(university_name)
['Loyola University Chicago',
'Queens University',
...
'Yale University']
页面的链接是javascript:void()
,所以不知道如何一一掌握每一页的内容。
<div class="pagination"><div aria-live="polite" role="status" style="float:left; height:14px; padding:8px">Showing 1 to 20 of 143981 entries</div><div style="float:right;"><ul class="pagination" id="pagin_count"><li class="active" p="1"><a>1</a></li><li p="2"><a href="javascript:void()" onclick="pagingcustom(2);">2</a></li><li p="3"><a href="javascript:void()" onclick="pagingcustom(3);">3</a></li><li p="4"><a href="javascript:void()" onclick="pagingcustom(4);">4</a></li><li p="5"><a href="javascript:void()" onclick="pagingcustom(5);">5</a></li><li p="6"><a href="javascript:void()" onclick="pagingcustom(6);">6</a></li><li p="7"><a href="javascript:void()" onclick="pagingcustom(7);">7</a></li><li p="8"><a href="javascript:void()" onclick="pagingcustom(8);">8</a></li><li p="9"><a href="javascript:void()" onclick="pagingcustom(9);">9</a></li><li p="10"><a href="javascript:void()" onclick="pagingcustom(10);">10</a></li><li p="1"><a href="javascript:void()" onclick="pagingcustom(1);">Next</a></li></ul></div></div>
</div>
<script>
function fn_advcount(id){
$.ajax({
url: 'https://www.gotouniversity.com/site/advertisement-count',
data: { id : id },
success: function(result){
}});
}
</script>
我要获取的相关内容
<a href="/university/loyola-university-chicago" target="_blank" title="University">
<p class="university-name" title="Loyola University Chicago">Loyola University Chicago</p>
</a>
我已经阅读了一些相关问题,但我仍然无法找到解决方案
我也测试bs4
过它可以抓取第一页上的内容
import bs4
import requests
bowl = requests.get('https://www.gotouniversity.com/course/index')
soup = bs4.BeautifulSoup(bowl.text, 'html.parser')
UniversityName = [i.text for i in soup.find_all('p', attrs={'class': 'university-name'})]
解决方案
使用beautifulsoup
,这将打印大学名称和链接的前 10 页:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gotouniversity.com/course/index'
params = {'page': 1}
for page in range(1, 11):
print('Page no.{}...'.format(page))
print('-' * 120)
print()
params['page'] = page
soup = BeautifulSoup( requests.post(url, data=params).text, 'html.parser' )
for a in soup.select('a[title="University"]'):
print('{: <60}{}'.format(a.get_text(strip=True), a['href']))
print()
印刷:
Page no.1...
------------------------------------------------------------------------------------------------------------------------
Loyola University Chicago /university/loyola-university-chicago
Queens University /university/queens-university
University of Wollongong /university/university-of-wollongong
Nanyang Technological University /university/nanyang-technological-university
Kaunas University of Technology /university/kaunas-university-of-technology
University of Bristol /university/university-of-bristol
University of Victoria /university/university-of-victoria
National University of Singapore NUS /university/national-university-of-singapore-nus
Duke University /university/duke-university
Queens University /university/queens-university
New Jersey Institute of Technology /university/new-jersey-institute-of-technology
Swinburne University of Technology /university/swinburne-university-of-technology
University of Alberta /university/university-of-alberta
Cardiff University /university/cardiff-university
St Clair College /university/st-clair-college
Stanford University /university/stanford-university
McGill University /university/mcgill-university
Arizona State University Tempe /university/arizona-state-university-tempe
University of North Carolina Greensboro /university/university-of-north-carolina-greensboro
Yale University /university/yale-university
Page no.2...
------------------------------------------------------------------------------------------------------------------------
Cambrian College /university/cambrian-college
Simon Fraser University Burnaby /university/simon-fraser-university-burnaby
University of Bologna /university/university-of-bologna
Memorial University of Newfoundland /university/memorial-university-of-newfoundland
Centennial College /university/centennial-college
University of Groningen /university/university-of-groningen
Griffith University Gold Coast Campus /university/griffith-university-gold-coast-campus
Texas A and M University College Station /university/texas-a-and-m-university-college-station
University of Calgary /university/university-of-calgary
University of Melbourne /university/university-of-melbourne
Fanshawe College /university/fanshawe-college
Zurich Swiss Federal Institute of Technology ETH /university/zurich-swiss-federal-institute-of-technology-eth
Northeastern University /university/northeastern-university
Adelphi University /university/adelphi-university
Heriot Watt University Dubai /university/heriot-watt-university-dubai
University of Ottawa /university/university-of-ottawa
University of Regina /university/university-of-regina
University of Regina /university/university-of-regina
Humber College North Campus /university/humber-college-north-campus
Seneca College /university/seneca-college
...and so on.
推荐阅读
- php - Laravel 重定向变量
- javascript - addEventListener 不适用于 d3js 的 append 方法添加的元素
- javascript - Stripe - 使用 Stripe Connect 创建结帐会话
- c - 如何将结构复合文字作为参数传递给函数?
- magento - Magento 2.4.2-如何禁用两因素身份验证?
- fortran - 在 Sublime-Text-3 中显示 gfortran 的输出构建,在已经存在的登录 Shell 中
- python - 散景中的 DataRangeSlider
- javascript - Luhn 算法的背景颜色 - JavaScript
- java - OneSignal 中的 Google Play 服务库错误,为什么?
- bash - 如何生成和总结一个序列