python - 来自 url 的 Python ftech 标题和 pdf 链接
问题描述
我正在尝试从一个 url 获取书名和嵌入书籍的 url 链接,该 url 的 html 源内容如下所示,我只是从中提取了一小部分来理解。
当链接名称在这里..但是小源html部分如下..
<section>
<div class="book row" isbn-data="1601982941">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>Learning Deep Architectures for AI</h2>
<span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
<div class="meta-auth-ttl"></div>
<p>Foundations and Trends(r) in Machine Learning.</p>
<div>
<a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
<section>
<div class="book row" isbn-data="1496034023">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<span class="meta-auth"><b>Roberto Battiti & Mauro Brunato, 2013</b></span>
<div class="meta-auth-ttl"></div>
<p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
<div>
<a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
我试过下面的代码:
此代码仅获取书名或书名,但仍有标题<h2>
打印。我也期待打印Book name
和预订的 pdf 链接。
#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq
web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()
soup = bs(web_res, 'html.parser')
headers = soup.find_all(['h2'])
print(*headers, sep='\n')
#divs = soup.find_all('div')
#print(*divs, sep="\n\n")
header_1 = soup.find_all('h2', class_='book-container')
print(header_1)
输出:
<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>
期望的输出:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
请帮助我了解如何实现这一点,因为我已经搜索过,但由于缺乏知识,我无法得到它。当我看到 html 源代码时,有很多div
and class
,所以很少混淆选择哪个类来获取href
and h2
。
解决方案
HTML 的结构非常好,您可以在这里使用它。该站点显然使用 Bootstrap 作为样式脚手架(您几乎可以忽略其中的row
和类。col-[size]-[gridcount]
你基本上有:
<div class="book">
每本书 一本- 一列
<div class="book-cats">
类别和- 图片
- 第二列
<div class="star-ratings">
收视率块<h2>
书名<span class="meta-auth">
作者线<p>
书籍描述- 两个链接
<a class=“btn" ...>
- 一列
其中大部分可以忽略不计。标题和你想要的链接都是它们类型的第一个元素,所以你可以使用它们element.nested_element
来抓取。
所以你所要做的就是
- 循环遍历所有
book
div。 - 对于每个这样的 div,取第
h2
一个a
元素。 - 对于标题,取包含的文本
h2
- 对于链接,采用锚链接的
href
属性。a
像这样:
for book in soup.select("div.book:has(h2):has(a.btn[href])"):
title = book.h2.get_text(strip=True)
link = book.select_one("a.btn[href]")["href"]
# store or process title and link
print("Title:", title)
print("Link:", link)
我使用.select_one()
CSS 选择器来更具体地了解要接受的链接元素;.btn
指定类并且必须存在属性[href]
。href
我还通过将图书搜索限制为同时具有标题和至少 1 个链接的 div 来增强图书搜索;:has(...)
选择器限制与具有特定子元素的匹配。
以上产生:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...
推荐阅读
- scala - 访问无法访问类型信息的类时出现错误
- node.js - Twilio Call,为 xml 使用自定义端点
- swift - Swift:解码从 GameKit 发送的消息
- office365 - 用于 Office 365 的本地解决方案的 Office 插件部署
- iis-10 - 禁用 IIS10 的“Vary”标头
- javascript - 如何在包含其他标签的`div`中通过`DOM`使用`onmouse`?
- plugins - 如何在 bcftools 中使用插件命令?
- javascript - 在父弹出和子弹出/jsp页面之间传递一个值
- spring - Junit 测试,我不能用 Diamonds Class 模拟课程
- networking - 阿联酋阻止了 WEBRTC 对等连接?