首页 > 解决方案 > 如何遍历 Python 中的字符串列表并连接属于标签的字符串?

问题描述

在 Python 3 中遍历元素列表时,如何“隔离”感兴趣的元素之间的内容?

我有一个清单:

list = ["<h1> question 1", "question 1 content", "question 1 more content", "<h1> answer 1", "answer 1 content", "answer 1 more content", "<h1> question 2", "question 2 content", "<h> answer 2", "answer 2 content"]

在此列表中,有带有标签 < h > 的元素和其他没有标签的元素。这个想法是具有此标签的元素是“标题”,直到下一个标签的以下元素是它的内容。

如何连接属于 header 的列表元素以具有两个相等大小的列表:

headers = ["<h1> question 1", "<h1> answer 1", "<h1> question 2", "<h> answer 2"]
content = ["question 1 content question 1 more content", "answer 1 content answer 1 more content", "question 2 content", "answer 2 content"]

这两个列表的长度相同,在这种情况下,每个列表有 4 个元素。

我能够将这些部分分开,但您可以使用一些帮助来完成:

list = ["<h1> question 1", "question 1 content", "question 1 more content", "<h1> answer 1", "answer 1 content", "answer 1 more content", "<h1> question 2", "question 2 content", "<h> answer 2", "answer 2 content"]

headers = []
content = []

for i in list:
    if "<h1>" in i:
        headers.append(i)

    if "<h1>" not in i:
        tempContent = []
        tempContent.append(i)
        content.append(tempContent)

关于如何组合这些文本以使它们一一对应的任何想法?

谢谢!

标签: pythonpython-3.xstringlist

解决方案


假设在每个标题之后所有元素都是该标题的内容,并且第一个元素始终是标题 - 您可以使用itertools.groupby.

key可以是元素是否具有标题标签,这样标题的内容将在其后分组:

from itertools import groupby

lst = ["<h1> question 1", "question 1 content", "question 1 more content", "<h1> answer 1", "answer 1 content", "answer 1 more content", "<h1> question 2", "question 2 content", "<h> answer 2", "answer 2 content"]

headers = []
content = []

for key, values in groupby(lst, key=lambda x: "<h" in x):
    if key:
        headers.append(*values)
    else:
        content.append(" ".join(values))

print(headers)
print(content)

给出:

['<h1> question 1', '<h1> answer 1', '<h1> question 2', '<h> answer 2']
['question 1 content question 1 more content', 'answer 1 content answer 1 more content', 'question 2 content', 'answer 2 content']

您当前方法的问题是您总是只将一项添加到内容中。您要做的是累积temp_content列表,直到遇到下一个标题,然后才添加它并重置:

headers = []
content = []
temp_content = None

for i in list:
    if "<h" in i:
        if temp_content is not None:
            content.append(" ".join(temp_content))
            temp_content = []
        headers.append(i)

    else:
        temp_content.append(i)

推荐阅读