首页 > 解决方案 > XPath 未按预期运行

问题描述

我有一些非常混乱的 HTML 标签,想提取段落信息,没有 HTML,但是我发现我只能得到第一段。例如,HTML 看起来像:

   <p>BLAH BLAH<strong><nobr><strong>people</strong></nobr></strong>&#39;s work <strong>&quot;Blah <nobr><strong><span style="font-size:14pt"><strong>blah</strong></span></strong></nobr> and <nobr><strong><span style="font-size:14pt"><strong>Nothing</strong></span></strong></nobr> quote&quot;</strong>lalal</p>

<p>More text<strong><nobr><strong>More text</strong></nobr></strong> blah blah</p>

我正在尝试:

converted = html.fromstring(body)
para = converted.xpath('//*[starts-with(name(), "p")]')

并循环遍历段落:

string_content = ''
for p in para:          
    if p.text is not None:
        string_content += ' ' + p.text

但是我只得到一个<p>元素,即第一个元素。这段代码似乎无法获取我需要的所有内容,并且通常只提供第一条信息。

标签: pythonxpath

解决方案


如果要获取p标签内的所有内容,可以执行以下操作:

from lxml import html

body = '<p>BLAH BLAH<strong><nobr><strong>people</strong></nobr></strong>&#39;s work <strong>&quot;Blah <nobr><strong><span style="font-size:14pt"><strong>blah</strong></span></strong></nobr> and <nobr><strong><span style="font-size:14pt"><strong>Nothing</strong></span></strong></nobr> quote&quot;</strong>lalal</p><p>More text<strong><nobr><strong>More text</strong></nobr></strong> blah blah</p>'

converted = html.fromstring(body)
para = converted.xpath('//p')

content = [p.text_content() for p in para if p.text_content()]
content = ' '.join(content)
print content

结果是:

BLAH BLAHpeople's work "Blah blah and Nothing quote"lalal More textMore text blah blah

推荐阅读