我有一个列表列表,其中包含 OCR 表格中每一列的起始位置。

[[16, 102, 119, 136],
 [16, 48, 76, 109, 145],
 [16, 47, 75, 108, 128, 145],
 [16, 48, 77, 110, 141],
 [98, 135]]

最初的想法是使用最长的列表作为参考,通过相似性对齐其他列表。概念上类似于模糊连接,但每个值只允许 1 个匹配项(至少 1 个匹配项,最多 1 个匹配项)。


[[16, '', '', 102, 119, 136],
 [16, 48, 76, 109,  '', 145],
 [16, 47, 75, 108, 128, 145],
 [16, 48, 77, 110,  '', 141],
 ['', '', '',  98,  '', 135]]

全局目标是将该字符串放入数据框中,我证明以防万一提出任何其他方法。如您所见,它缺少标题和缺少单元格,因此我有上述想法,以便稍后将每个字符串的公共位置拆分为 csv。

                Cuentas a  la  banca                                                                  INTERES          DIVISA           EUR 
                CUENTA CORRIENTE EMPRESAS      0000  0000  000000000000    EUR                              0,00 %                              0.00 
                CUTRECUENTA EMPRESAS           0000  0000  000000000000    USD                              0.00 %              00.00            00.00 
                CUENTA CORRIENTE EMPRESAS       0000  0000  000000000000     EUR                              0.00%                          00 000.00 
                                                                                                  TOTAL                                00 000,00 

标签: pythonjoinocrnested-liststabular



import numpy as np

beginings = [[16, 102, 119, 136], [16, 48, 76, 109, 145], [16, 47, 75, 108, 128, 145], [16, 48, 77, 110, 141], [98, 135]]
# beginings = [[16, 17, 18, 136], [16, 17, 18, 109, 145], [16, 47, 75, 108, 128, 145], [16, 48, 77, 110, 141], [98, 135]] # use that to reproduce possible issue when positons was filled before
num_col = max([len(i) for i in beginings])

# Get longest row as a reference, and others will be matched by similarity from longest row.
index_longest_list = max(enumerate(beginings), key=lambda tup: len(tup[1]))[0]

def distance(x, y):
    return abs(x - y)

aligned_list = np.full((len(beginings), num_col), np.nan)
reference = beginings[index_longest_list]

for row_pos, line in enumerate(beginings):
    for start in line:
        distances = []
        for col_pos, j in enumerate(reference):
            distances.append(distance(start, j))
        index = np.argmin(distances)
        while not np.isnan(aligned_list[row_pos, index]):
            previous_value = aligned_list[row_pos, index]
            if start > previous_value:
                index += 1
            elif start <= previous_value:
                index -= 1
        if np.isnan(aligned_list[row_pos, index]):
            aligned_list[row_pos, index] = start

