首页 > 解决方案 > 使用 BeautifulSoup 提取重复标签中的特定文本

问题描述

我正在从事一个数字人文项目,试图将图像的描述从一系列数字化雕刻中分离出来。(总的来说,我对编码和编程也相当陌生,因为我只是一个谦逊的哲学家,踏入 DH 的水域)到目前为止,我已经能够使用 Python 和一个看起来像这样的 urllib 脚本隔离源代码:

import urllib.request
import urllib.parse


url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)
print(f.read().decode('utf-8'))

但是,我的问题出现在源代码本身。描述与其他信息一起放置,这些信息都由 P 和 b 标签分解:

</div>
    <div class="col-sm-6">                                                
    <P>
      <b>Book Title:</b>
      <A HREF="book_detail.cfm?ID=2449">The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré</a>
    </p>              
    <P>
        <b>Author:</b> Doré, Gustave, 1832-1883
    </p>
    
    <P>
        <b>Image Title:</b> Baptism of Jesus
    </p>
    <P>
      <b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
    </p>
    <P>
        <b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
    </P>
    <P>
        <A HREF="book_list.cfm?ID=2449">Click here
        </a> for additional images available from this book.
    </P>
    <p>For information on licensing this image, please send an email, including a link to the image, to 
        <a href="mailto:dia@emory.edu?subject=Licensing%20Image%20From%20DIA - 17250">dia@emory.edu</a>
    </p>
</div>

如何使用 BeautifulSoup 从这些标签中分离出描述文本?到目前为止,我在 StackOverFlow 上发现的一切都表明它可能是可行的;但是我还没有找到专门尝试这样做的东西。

同样,在源代码中,我只想提取描述“施洗约翰为耶稣施洗……”。我怎么能这样做呢?

谢谢!再次为我缺乏扎实的知识感到抱歉。

标签: pythonweb-scrapingbeautifulsoup

解决方案


使用以下代码,我几乎可以实现您想要的东西:

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)

soup = BeautifulSoup(f, 'html.parser')
parent = soup.find("b", text="Description:").parent
parent.find("b", text="Description:").decompose()
print(parent.text)

我添加了 BeautifulSoup 并删除了描述。


推荐阅读