首页 > 解决方案 > 从混乱的 .csv 文件中解析/提取表?

问题描述

我正在使用 Amazon Textract 解析图像 (png) 并提取表格。open(file_name, "r")当我打开它并阅读它的行时,这是一个这样的 csv 示例:

['Table: Table_1\n',
 '\n',
 'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
 'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
 'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
 'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
 'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
 'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
 'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
 'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
 'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
 'AST ,27 ,,10-35 U/L ,EN ,\n',
 'ALT ,19 ,,9-46 U/L ,EN ,\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

我可以阅读它,pandas read_csv但我遇到了错误(它总是以不同的格式出现——或多或少的空格,标题前的第一行不同)。请告知如何从此类 csv 中提取表格?

标签: python-3.xpandasamazon-textract

解决方案


我建议整理您的数据,将整理的数据作为列表插入到 Pandas 中。我在您的示例中发现的问题是,在第一个字段中,它包含逗号,这会干扰 CSV 解析,也可以通过逗号分隔符工作。因此,需要对数据进行管理。请在下面找到我的 Python 3 源代码:

data = ['Table: Table_1\n',
        '\n',
        'Test Name ,Result ,Flag ,Reference Range ,Lab ,\n',
        'HEPATIC FUNCTION PANEL PROTEIN, TOTAL ,6.1 ,,6.1-8.1 g/dL ,EN ,\n',
        'ALBUMIN ,4.3 ,,3.6-5.1 g/dL ,EN ,\n',
        'GLOBULIN ,1.8 ,LOW ,1.9-3.7 g/dL (calc) ,EN ,\n',
        'ALBUMIN/GLOBULIN RATIO ,2.4 ,,1.0-2.5 (calc) ,EN ,\n',
        'BILIRUBIN, TOTAL ,0.6 ,,0.2-1.2 mg/dL ,EN ,\n',
        'BILIRUBIN, DIRECT ,0.2 ,,< OR = 0.2 mg/dL ,EN ,\n',
        'BILIRUBIN, INDIRECT ,0.4 ,,0.2-1.2 mg/dL (calc) ,EN ,\n',
        'ALKALINE PHOSPHATASE ,61 ,,40-115 U/L ,EN ,\n',
        'AST ,27 ,,10-35 U/L ,EN ,\n',
        'ALT ,19 ,,9-46 U/L ,EN ,\n',
        '\n',
        '\n',
        '\n',
        '\n',
        '\n']



lines  = [x.replace('\n','') for x in data]

import re
p = re.compile('^[/A-Z ]+[,]*[/A-Z ]*,')
curated_lines = []
for l in lines:
    m = p.search(l)
    if m != None:
        s   = m.group(0)
        cs  = s.replace(',','')
        cl  = l.replace(s,cs+',')
        curated_lines.append(cl)

frame_list_of_list = [l.split(',')[:-1] for l in curated_lines]

import pandas as pd
df = pd.DataFrame(frame_list_of_list,columns=['Test Name','Result','Flag','Reference Range','Lab'])
print(df)

这会产生以下结果:

                           Test Name Result  Flag        Reference Range  Lab
0  HEPATIC FUNCTION PANEL PROTEIN TOTAL    6.1                 6.1-8.1 g/dL   EN 
1                               ALBUMIN    4.3                 3.6-5.1 g/dL   EN 
2                              GLOBULIN    1.8   LOW    1.9-3.7 g/dL (calc)   EN 
3                ALBUMIN/GLOBULIN RATIO    2.4               1.0-2.5 (calc)   EN 
4                       BILIRUBIN TOTAL    0.6                0.2-1.2 mg/dL   EN 
5                      BILIRUBIN DIRECT    0.2             < OR = 0.2 mg/dL   EN 
6                    BILIRUBIN INDIRECT    0.4         0.2-1.2 mg/dL (calc)   EN 
7                  ALKALINE PHOSPHATASE     61                   40-115 U/L   EN 
8                                   AST     27                    10-35 U/L   EN 
9                                   ALT     19                     9-46 U/L   EN 

推荐阅读