首页 > 解决方案 > 解析html元素

问题描述

在我(下载的)HTML 中,我在提到的每个文件的顶部都有主管(例如下面代码中的 Dror Ben Asher):

<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>

沿着 html,这些高管的名字会重复出现多次,在名字后面跟着我要解析的文本元素示例

<P>
<STRONG> Dror Ben Asher </STRONG>
</P>
<P>Yeah, in terms of production in first quarter, we’re going to be lower than we had forecasted mainly due to our grade.  We’ve had a couple of higher grade stopes in our Seabee complex that we’ve had some significant problems in terms of ground failures and dilution effects.  In addition, not helping out, we’ve had some equipment downtime on some of our smaller silt development, so the combination of those two issues are affecting us.
</p>

现在我有一个代码(见下文),它标识了一位执行官“Dror Ben Asher”,并掌握了 P 元素中出现的所有文本。但我希望这适用于所有高管以及提到不同高管(不同公司)的多个 html 文件。

import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
    txt = answer.get_text(strip=True)

    s = answer.find_next_sibling()
    while s:
        if s.name == 'strong' or s.find('strong'):
            break
        if s.name == 'p':
            txt += ' ' + s.get_text(strip=True)
        s = s.find_next_sibling()

    txt = ('\n' + ' '*31).join(textwrap.wrap(txt))

    print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")

有没有人有解决这个挑战的建议?

标签: pythonhtmlbeautifulsouphtml-parsing

解决方案


如果我正确理解您的问题,您可以将代码放在一个函数中,您可以将所需的名称作为参数传递给该函数,并使用该变量来构造您的搜索字符串。例如:

def func(name_to_find):
# some code
    for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("{n}") + p'.format(n=name_to_find)):
# some other code

并这样称呼它:

func('Dror Ben Asher')

推荐阅读