首页 > 解决方案 > BeautifulSoup with Recursion:获取 HTML 中具有最多孩子/最长路径的 html 标签

问题描述

我正在尝试获取具有最多孩子的 HTML 标记。

示例 HTML:

<html>
    <head>
        <meta></meta>
        <script></script>
    </head>
    <body>
        <div>
            <p></p>
        </div>
        <div>
            <p>
                <span> Longest Path </span>
            </p>
        </div>
    </body>
</html>
"""

我想得到的是html > body > div > p > span

现在我正在尝试使用递归和 bs4 获取所有路径

from bs4 import BeautifulSoup

HTML = """
<html>
    <head>
        <meta></meta>
        <script></script>
    </head>
    <body>
        <div>
            <p></p>
        </div>
        <div>
            <p>
                <span> Longest Path </span>
            </p>
        </div>
    </body>
</html>
"""


def longest_path():
    """ Function that will return the longest path in the HTML """
    soup = BeautifulSoup(HTML, "html.parser")
    tags = soup.find_all(recursive=False)
    paths = []
    for tag in tags:
        path = []
        full_path = _recursive_path(tag, path)
        paths.append(full_path)
    return paths


def _recursive_path(tag, path):
    """ Function that uses recursion to calculate path """
    path.append(tag.name)
    tag_children = tag.find_all(recursive=False)
    if not tag_children:
        return path

    for tag_child in tag_children:
        _recursive_path(tag_child, path)


print(longest_path())

但到目前为止,这并没有产生我想要的结果。有任何想法吗?

标签: pythonrecursionbeautifulsoup

解决方案


您可以将递归与生成器一起使用。soup.contents可以通过在每个级别上迭代和递增计数器来遍历 HTML :

from bs4 import BeautifulSoup as soup, NavigableString as ns
def get_paths(d, p = [], c = 0):
   if not (k:=[i for i in getattr(d, 'contents', []) if not isinstance(i, ns)]):
      yield (c, ' > '.join(p+[d.name]))
   else:
      for i in k:
         yield from get_paths(i, p=p+[d.name],c = c+1)

_, path = max(get_paths(soup(HTML, 'html.parser').html), key=lambda x:x[0])

输出:

'html > body > div > p > span'

推荐阅读