首页 > 解决方案 > How to get text from HTML element by using lxml.html

问题描述

I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809.
The element should contain a sub-string "Арбитраж".
And it does, because my code

for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
    print(div)

returns response

Element div at 0x15480d93ac8

enter image description here

But when I'm trying to get the full text itself by using method div.text, it returns None
Which is a strange result, I think. What should I do?
Any help would be greatly appreciated. As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.

标签: pythonhtmllxmllxml.html

解决方案


这是当 xpath 由宿主语言和库处理时发生的这些奇怪的事情之一。当您使用 xpath 表达式时

 .//div[contains(text(), "Арбитраж")] 

搜索是根据 xpath 规则执行的,它认为目标文本包含在目标div中。当您继续下一行时:

print(div.text)

您正在使用 lxml.html,它显然不将目标文本视为文本的一部分div,因为它前面有<i>标记。要使用 lxml.html,您必须使用:

print(div.text_content())

或仅使用 xpath:

print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])

似乎 lxml.etree 和 beautifulsoup 使用不同的方法。在这里看到这个有趣的讨论。


推荐阅读