python - 如何从多行字符串(或列表)中提取列名?
问题描述
我有以下字符串示例:
column_names = """
================================================================================
total total final
store. toys output person 1/ usage 5/ stock
================================================================================
"""
我可以像这样逐行分解它:
column_lines = [
' ================================================================================',
' total total final',
' store. toys output person 1/ usage 5/ stock',
' ================================================================================',
]
在不知道字符串中的文本的情况下,我想找到一种方法,最终得到以下列表:
['store', 'toys', 'total output', 'person', 'total usage', 'final stock']
我正在努力寻找解决此问题的任何方法。
有什么不同的方法来解决这个问题,我如何从多行文本中提取字符串,而不知道作为列名的期望是什么?
解决方案
基本工作解决方案
这是一个有效的解决方案。我们需要指定两条线来“组合”在一起。
def find_group(l1, l2):
def intersect(x1, x2):
return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
or (x2[0] <= x1[1] and x2[1] >= x1[0])
pat = r"[a-zA-Z]+"
matches1 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l1)]
matches2 = [(match.start(0), match.end(0)) for match in re.finditer(pat, l2)]
ret = []
for g2 in matches2:
add_g2 = True
for g1 in matches1:
if intersect(g1, g2):
ret.append(l1[g1[0]:g1[1]]+" "+l2[g2[0]:g2[1]])
add_g2 = False
break
if add_g2:
ret.append(l2[g2[0]:g2[1]])
return ret
以下是它如何处理您的示例:
find_group(column_lines[1], column_lines[2]) # Needs to define which lines.
# > ['store', 'toys', 'total output', 'person', 'total usage', 'final stock']
一般解决方案
这是一个适用于任意数量行的解决方案。
def find_group(lines):
if isinstance(lines, str):
lines = lines.split("\n")
def intersect(x1, x2):
"""Checks if two couples of x-coordinates intersect."""
return (x1[0] <= x2[1] and x1[1] >= x2[0]) \
or (x2[0] <= x1[1] and x2[1] >= x1[0])
pat = r"[a-zA-Z]+"
# Coordinates of all parts matching the pattern, per line
matches = [[(match.start(0), match.end(0)) for match in re.finditer(pat, line)]
for line in lines]
# Starts by comparing line 0 and line 1
groups = matches[0]
for i in range(1, len(lines)):
for g2 in matches[i]:
add_g2 = True
for i_g1, g1 in enumerate(groups):
if intersect(g1, g2):
# Merge both lines intersection into the variable groups
groups[i_g1] = [min(g1[0], g2[0]), max(g1[1], g2[1])]
add_g2 = False
break
if add_g2:
# If alone in the x-coord, adds the match as a new group
groups.append([g2[0], g2[1]])
# "groups" becomes the merge of the first i lines results.
# Sorts the groups by their first coordinate.
# Then joins all matches located between each group's coordinates
listed_groups = [[" ".join(re.findall(pat, line[group[0]: group[1]]))
for line in lines]
for group in sorted(groups)]
# Replaces all unnecessary whitespaces and format groups as strings
return [re.sub("\s+", " ", " ".join(g).strip()) for g in listed_groups]
结果:
column_names = column_names = """
================================================================================
some I each hopefully
kind love
of
total total final
store. toys output person 1/ usage 5/ stock
================================================================================"""
find_group(column_names)
# > ['some kind of store',
# 'I love toys',
#'total output',
#'each person',
#'total usage',
#'hopefully final stock']
如果您需要更多解释,请告诉我。
推荐阅读
- encryption - 密码反馈模式:s 位大小混淆
- xamarin.forms - 带有圆形中央按钮的 Xamarin Forms 导航栏
- python - 是否有一种可接受的 Pythonic 方式将基类部分与子类对象分离?
- django - 如果在 django rest 框架中不存在,则自动在反向外部表中创建项目
- python-3.x - Django NoReverseMatch:找不到“条目”的反向。“登录”不是有效的视图函数或模式
- liquibase - liquibase 中的约束
- java - 无法启动 logbox,因为 react 无法在 Google Pixel 8.0 中创建根视图
- python - 使用while退出for循环
- snowflake-cloud-data-platform - 如何在存储过程中实现执行流程并获得表输出
- html - 替换文件中特定行号的字符