python - 解析html元素
问题描述
在我(下载的)HTML 中,我在提到的每个文件的顶部都有主管(例如下面代码中的 Dror Ben Asher):
<DIV id=article_participants class="content_part hid">
<P>Redhill Biopharma Ltd. (NASDAQ:<A title="" href="http://seekingalpha.com/symbol/rdhl" symbolSlug="RDHL">RDHL</A>)</P>
<P>Q4 2014 <SPAN class=transcript-search-span style="BACKGROUND-COLOR: yellow">Earnings</SPAN> Conference <SPAN class=transcript-search-span style="BACKGROUND-COLOR: #f38686">Call</SPAN></P>
<P>February 26, 2015 9:00 AM ET</P>
<P><STRONG>Executives</STRONG></P>
<P>Dror Ben Asher - CEO</P>
<P>Ori Shilo - Deputy CEO, Finance and Operations</P>
<P>Guy Goldberg - Chief Business Officer</P>
沿着 html,这些高管的名字会重复出现多次,在名字后面跟着我要解析的文本元素示例
<P>
<STRONG> Dror Ben Asher </STRONG>
</P>
<P>Yeah, in terms of production in first quarter, we’re going to be lower than we had forecasted mainly due to our grade. We’ve had a couple of higher grade stopes in our Seabee complex that we’ve had some significant problems in terms of ground failures and dilution effects. In addition, not helping out, we’ve had some equipment downtime on some of our smaller silt development, so the combination of those two issues are affecting us.
</p>
现在我有一个代码(见下文),它标识了一位执行官“Dror Ben Asher”,并掌握了 P 元素中出现的所有文本。但我希望这适用于所有高管以及提到不同高管(不同公司)的多个 html 文件。
import textwrap
import os
from bs4 import BeautifulSoup
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/out'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt), file=open("output.txt", "a")
有没有人有解决这个挑战的建议?
解决方案
如果我正确理解您的问题,您可以将代码放在一个函数中,您可以将所需的名称作为参数传递给该函数,并使用该变量来构造您的搜索字符串。例如:
def func(name_to_find):
# some code
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("{n}") + p'.format(n=name_to_find)):
# some other code
并这样称呼它:
func('Dror Ben Asher')
推荐阅读
- webdriver-io - Appium - Android - 错误:生成 appium ENOENT
- autodesk-viewer - 何时加载自定义全景扩展
- python - 如何打印从今天起接下来的 3 个日期?
- c# - 无法使用正则表达式模式进行多次匹配。模式视为单个匹配
- sql - 根据 A 列的值,仅允许 B 列中的某些值
- matlab-figure - 如何使用 Matlab 绘制涉及三个参数的三个方程 (x=x(u,v,w), y=y(u,v,w), z=z(u,v,w)) 的图形?
- asp.net-mvc - 直接提交到特定控制器操作项的引导对话框
- java - 是否可以从数据库中删除实体,然后在 JpaRepository 中返回这个值?
- reactjs - Is there a way to emulate the run frequency of constructor code using the React Hooks API?
- node.js - 使用自定义 POST api 插入两个表