首页 > 解决方案 > 来自 url 的 Python ftech 标题和 pdf 链接

问题描述

我正在尝试从一个 url 获取书名和嵌入书籍的 url 链接,该 url 的 html 源内容如下所示,我只是从中提取了一小部分来理解。

当链接名称在这里..但是小源html部分如下..

<section>
  <div class="book row" isbn-data="1601982941">
    <div class="col-lg-3">
      <div class="book-cats">Artificial Intelligence</div>
      <div style="width:100%;">
        <img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
      </div>
    </div>
    <div class="col-lg-6">
      <div class="star-ratings"></div>
      <h2>Learning Deep Architectures for AI</h2>
      <span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
      <div class="meta-auth-ttl"></div>
      <p>Foundations and Trends(r) in Machine Learning.</p>
      <div>
        <a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
        <a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
      </div>
    </div>
  </div>
</section>
<section>
  <div class="book row" isbn-data="1496034023">
    <div class="col-lg-3">
      <div class="book-cats">Artificial Intelligence</div>
      <div style="width:100%;">
        <img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
      </div>
    </div>
    <div class="col-lg-6">
      <div class="star-ratings"></div>
      <h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
      <span class="meta-auth"><b>Roberto Battiti &amp; Mauro Brunato, 2013</b></span>
      <div class="meta-auth-ttl"></div>
      <p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
      <div>
        <a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
        <a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
      </div>
    </div>
  </div>
</section>

我试过下面的代码:

此代码仅获取书名或书名,但仍有标题<h2>打印。我也期待打印Book name和预订的 pdf 链接。

#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq


web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()

soup = bs(web_res, 'html.parser')

headers = soup.find_all(['h2'])
print(*headers, sep='\n')

#divs = soup.find_all('div')
#print(*divs, sep="\n\n")

header_1 = soup.find_all('h2', class_='book-container')
print(header_1)

输出:

<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>

期望的输出:

Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf

请帮助我了解如何实现这一点,因为我已经搜索过,但由于缺乏知识,我无法得到它。当我看到 html 源代码时,有很多divand class,所以很少混淆选择哪个类来获取hrefand h2

标签: pythonhtmlweb-scrapingbeautifulsoupurllib

解决方案


HTML 的结构非常好,您可以在这里使用它。该站点显然使用 Bootstrap 作为样式脚手架(您几乎可以忽略其中的row和类。col-[size]-[gridcount]

你基本上有:

  • <div class="book">每本书 一本
    • 一列
      • <div class="book-cats">类别和
      • 图片
    • 第二列
      • <div class="star-ratings">收视率块
      • <h2>书名
      • <span class="meta-auth">作者线
      • <p>书籍描述
      • 两个链接<a class=“btn" ...>

其中大部分可以忽略不计。标题和你想要的链接都是它们类型的第一个元素,所以你可以使用它们element.nested_element来抓取。

所以你所要做的就是

  • 循环遍历所有bookdiv。
  • 对于每个这样的 div,取第h2一个a元素。
  • 对于标题,取包含的文本h2
  • 对于链接,采用锚链接的href属性。a

像这样:

for book in soup.select("div.book:has(h2):has(a.btn[href])"):
    title = book.h2.get_text(strip=True)
    link = book.select_one("a.btn[href]")["href"]
    # store or process title and link
    print("Title:", title)
    print("Link:", link)

我使用.select_one()CSS 选择器来更具体地了解要接受的链接元素;.btn指定类并且必须存在属性[href]href

我还通过将图书搜索限制为同时具有标题和至少 1 个链接的 div 来增强图书搜索;:has(...)选择器限制与具有特定子元素的匹配。

以上产生:

Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...

推荐阅读