python - Python Beautiful Soup 仅获取第一个 Href
问题描述
我正在尝试从网页的 href 中抓取 URL,我已经截取了我正在抓取的一个 div 的列表项外观的片段。
我的问题是如何缩小下面的代码以仅抓取 HTML 的第一个 Href?
# import the module
import bs4 as bs
import urllib.request
import re
import PyPDF2
import pypyodbc
from time import sleep
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
soup = bs.BeautifulSoup(html,'lxml')
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
temp = link.get('href')
print(temp)
解决方案
您可以使用find
:
from bs4 import BeautifulSoup as soup
html ='<li><span class="num">20</span><span class="tmb tmb-xs tmb-artist-xs"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html"<img alt="The Sound Of Music - Do-Re-Mi lyrics" title="Do-Re-Mi" pagespeed_url_hash="552365003" src="http://img2-ak.lst.fm/i/u/174s/cf8387bbdbfc42ce82844a1cdfec9a33.png"></a></span><span class="song hasvid"><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html#startvideo" class="vid";"></a><a href="http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html" class="song-link hasvidtoplyric">Do-Re-Mi Lyrics </a><span class="artist"><a href="http://www.metrolyrics.com/the-sound-of-music-lyrics.html" class="subtitle" title="The Sound Of Music">The Sound Of Music </a></span></span><div class="last-week up">#21</div></li>'
result = soup(html, 'lxml').find('a')['href']
输出:
'http://www.metrolyrics.com/doremi-maria-and-the-children-lyrics-the-sound-of-music.html'
推荐阅读
- c++ - 金字塔的递归函数
- csv - 如何在 wordpress 4.9.9 以上上传 csv 文件
- python - Python 3.7.1 Pygame 1.9.4 错误:TypeError:需要整数参数,得到浮点数
- vb.net - 仅在特定项目中打开与 dbase 数据库的连接的问题
- python - Jupyter Notebook 未在 VS Code 中打开
- matlab - 法向量到支持向量机决策超平面的方向
- javascript - 简化 JavaScript 数组变量
- java - dropwizard框架和hibernate的并发问题
- reactjs - 无法从一个本地主机服务器获取资源到另一个
- java - 我们可以在抽象类中声明私有构造函数吗?什么时候出现这种情况?