首页 > 解决方案 > 如何从多行字符串(或列表)中提取列名?

问题描述

我有以下字符串示例:

column_names = """
    ================================================================================
                                             total                   total     final
    store.                        toys       output   person 1/     usage 5/   stock
    ================================================================================
"""

我可以像这样逐行分解它:

column_lines = [
'    ================================================================================',
'                                             total                   total     final',
'    store.                        toys       output   person 1/     usage 5/   stock',
'    ================================================================================',
]

在不知道字符串中的文本的情况下,我想找到一种方法,最终得到以下列表:

['store', 'toys', 'total output', 'person', 'total usage', 'final stock']

我正在努力寻找解决此问题的任何方法。

有什么不同的方法来解决这个问题,我如何从多行文本中提取字符串,而不知道作为列名的期望是什么?

标签: pythonregexstringparsing

解决方案


基本工作解决方案

这是一个有效的解决方案。我们需要指定两条线来“组合”在一起。

def find_group(l1, l2):

    def intersect(x1, x2):
        return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
            or (x2[0] <= x1[1] and x2[1] >= x1[0])

    pat = r"[a-zA-Z]+"
    matches1 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l1)]
    matches2 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l2)]

    ret = []
    for g2 in matches2:
        add_g2 = True
        for g1 in matches1:
            if intersect(g1, g2):
                ret.append(l1[g1[0]:g1[1]]+" "+l2[g2[0]:g2[1]])
                add_g2 = False
                break
        if add_g2:
            ret.append(l2[g2[0]:g2[1]])
                   
return ret

以下是它如何处理您的示例:

find_group(column_lines[1], column_lines[2]) # Needs to define which lines.
# > ['store', 'toys', 'total output', 'person', 'total usage', 'final stock']

一般解决方案

这是一个适用于任意数量行的解决方案。

def find_group(lines):

    if isinstance(lines, str):
        lines = lines.split("\n")

    def intersect(x1, x2):
        """Checks if two couples of x-coordinates intersect."""
        return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
            or (x2[0] <= x1[1] and x2[1] >= x1[0])

    pat = r"[a-zA-Z]+"
    # Coordinates of all parts matching the pattern, per line
    matches = [[(match.start(0), match.end(0)) for match in re.finditer(pat, line)] 
           for line in lines]

    # Starts by comparing line 0 and line 1
    groups = matches[0]
    for i in range(1, len(lines)):
        for g2 in matches[i]:
            add_g2 = True
            for i_g1, g1 in enumerate(groups):
                if intersect(g1, g2):
                    # Merge both lines intersection into the variable groups
                    groups[i_g1] = [min(g1[0], g2[0]), max(g1[1], g2[1])]
                    add_g2 = False
                    break
            if add_g2:
                # If alone in the x-coord, adds the match as a new group
                groups.append([g2[0], g2[1]])
            # "groups" becomes the merge of the first i lines results.
            
    # Sorts the groups by their first coordinate.
    # Then joins all matches located between each group's coordinates
    listed_groups = [[" ".join(re.findall(pat, line[group[0]: group[1]])) 
                  for line in lines]
                 for group in sorted(groups)]

    # Replaces all unnecessary whitespaces and format groups as strings
    return [re.sub("\s+", " ", " ".join(g).strip()) for g in listed_groups]

结果:

column_names = column_names = """
================================================================================
some                            I                    each                    hopefully
 kind                         love
 of
                                          total                  total     final
store.                        toys       output   person 1/     usage 5/   stock
================================================================================"""

find_group(column_names)
# > ['some kind of store',
# 'I love toys',
#'total output',
#'each person',
#'total usage',
#'hopefully final stock']

如果您需要更多解释,请告诉我。


推荐阅读