首页 > 解决方案 > 在 Beautiful Soup 中抓取 URL 但未获取所有链接

问题描述

这是 html 的相关部分,正在被抓取:

<div class="blockSpoiler-content">
   <div class="contentSpoiler">
      <div class="link-box" id="62H" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url1.net.html" target="_blank">Link1</a>
      </div>
      <div class="link-box" id="IFA" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url2.net.html" target="_blank">Link2</a>
      </div>
      <div class="link-box" id="ruG" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url3.com.html" target="_blank">Link3</a>
      </div>
      <div class="link-box" id="Bdf" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url4.com" target="_blank">Link4</a>
      </div>
      <div class="link-box" id="1Da" style="background-color: rgb(65, 120, 50);">
         <div class="status-box"><i class="working" title="Working"></i></div>
         <a rel="external" href="https://url5.net.html" target="_blank">Link5</a>
      </div>
   </div>
</div>

我正在尝试获取这些 URL:

  1. https://url1.net.html
  2. https://url2.net.html
  3. https://url3.com.html
  4. https://url4.com
  5. https://url5.net.html

我尝试了不同的东西,但只到了这里(本地文件仅用于测试目的,在网络抓取之前):

with open("mainLocalFile.html") as fp:
soup2 = BeautifulSoup(fp, 'html.parser')
links = soup2.find_all('div', class_='blockSpoiler-content')
# print(links)
for link in links:
    print(link)
    print(link.a)          # prints only the first tag
    print(link.a['href'])  # prints only the first URL

标签: pythonweb-scrapingbeautifulsoup

解决方案


选择所有<a>在标签下的类blockSpoiler-content(现在你只选择一个<div class=blockSpoiler-content>方法.find_all):

for a in soup.select(".blockSpoiler-content a"):
    print(a["href"])

印刷:

https://url1.net.html
https://url2.net.html
https://url3.com.html
https://url4.com
https://url5.net.html

推荐阅读