首页 > 解决方案 > 根据 textContent 获取 CSS-selector 或 XPath

问题描述

据我所知,BeautifulSoup 或 scrapy 等 Python 库可以为提供的 CSS 选择器或 XPath 返回文本内容。我正在寻找的是相反的——我想提供一个需要被抓取的文本,并希望获得 CSS-selector 或 XPath 来获取该文本。

这是可以用现有库完成的吗?

html = """
<h1 class="some-class">Article title</h1>
<div class="article-text">
  <p class="article-paragraph">Article paragraph text 1.</p>
  <p class="article-paragraph">Article paragraph text 2.</p>
</div>
"""

# ... some magic here with get_selector_by_text_content()
article_title_selector = get_selector_by_text_content("Article title", html) # 'h1.some-class'
article_body_selector = get_selector_by_text_content("Article paragraph text 1. \nArticle paragraph text 2.", html) # 'div.article-text > p'

标签: pythonweb-scrapingbeautifulsoupscrapy

解决方案


如果您可以使用 lxml,您可以获得所提供文本的 xpath:

import lxml.html
from lxml import etree

targets = ['Article title','Article paragraph text 1.','Article paragraph text 2.']

root = lxml.html.fromstring(html)
tree = etree.ElementTree(root)
for e in root.iter():
    for target in targets:
        if e.text== target:
            print(tree.getpath(e))

输出:

/div/h1
/div/div/p[1]
/div/div/p[2]

推荐阅读