html - Python 仅提取 for 循环中每第 n 次出现的第一个 href 链接
问题描述
我正在尝试使用 python 进行简单的网络抓取,但是在获取链接名称时存在问题,因为在下面提到href
的同一类中有 2 到 3 个标头,而我只需要为循环中的每个新事件打印第一个标头。btn
#!/usr/bin/python3
from bs4 import BeautifulSoup
import requests
url = "https://www.learndatasci.com/free-data-science-books/"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find_all('a', class_='btn')
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
print(tag.get('href'))
上述代码的输出:
http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
http://www.amazon.com/gp/product/0136042597/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=0136042597&linkCode=as2&tag=learnds-20&linkId=3FRORB7P56CEWSK5
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
http://amzn.to/1WePh0N
http://www.e-booksdirectory.com/details.php?ebook=9575
http://amzn.to/1FcalRp
虽然需要输出:
http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
http://www.e-booksdirectory.com/details.php?ebook=9575
解决方案
BeautifulSoup 具有出色的CSS 支持,只需使用它来挑选每个奇怪的项目:
soup = BeautifulSoup(data, 'lxml')
for tag in soup.select('a.btn:nth-of-type(odd)'):
演示:
>>> for tag in soup.select('a.btn:nth-of-type(odd)'): print(tag['href'])
...
http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
http://www.e-booksdirectory.com/details.php?ebook=9575
... etc
<div class="book">
您可以使用的每组链接都有一个父元素:
for tag in soup.select('.book a.btn:first-of-type'):
这适用于每本书的任意数量的链接。
推荐阅读
- apache-kafka - Kafka删除主题时无法删除主题并重新打开失败
- javascript - 未出现使用 JavaScript 的函数形式检查
- android - 保持 imageView 背景的纵横比
- angular - 数据更改时,Angular 订阅不会更新
- javascript - Java 脚本相互矛盾
- jquery - jQuery 是否支持 p:nth-line()?
- angular5 - 带有 Gitlab 的 Angular 5 CICD
- javascript - 代替与“NaN”进行比较
- python - 部署到 AWS lambda 达到大小限制并且包(例如 numpy)不兼容
- java - Java ProcessBuilder().start() 与 NodeJS require('child_process').spawn()