首页 > 解决方案 > 奇数 .txt 报告到 Pandas 数据框中

问题描述

我有一个 .txt 报告,其中包含 .txt 报告格式的帐号、地址和信用额度

它有分页符,但通常看起来像这样

Customer Address Credit limit A001 Wendy's 20000 123 Main Street City, State Zip

我希望我的数据框看起来像这样

Customer Address Credit Limit A001 Wendy's 123 Main Street, City, Statement 20000

这是我正在处理的示例 csv 的链接。

http://faculty.tlu.edu/mthompson/IDEA%20files/Customer.txt

我试图跳过行,但这没有用。

标签: pythonpandascsvreportanalysis

解决方案


好的,这种格式没有什么难的,但它不是 csv。因此既不能使用 Python csv 模块,也不能使用 pandas read_csv。我们将不得不手动解析它。

最复杂的决定是确定每个客户的第一行和最后一行。我会使用:

  • 第一行以仅包含大写字母和数字的单词开头,以仅包含数字且长度超过 100 个字符的单词结尾
  • 块在第一个空行结束

完成后:

  • 第一行包含帐号、姓名、第一行地址和账户限额
  • 随后的行包含地址的附加行
  • 字段位于固定位置:[5,19), [23,49), [57,77), [90,end_of_line)

在 Python 中会给出:

fieldpos = [(5,19), (23,49), (57,77), (90, -1)]  # position of fields in the initial line 

inblock = False                                  # we do not start inside a block

account_pat = re.compile(r'[A-Z]+\d+\s*$')       # regex patterns are compiled once for performance
limit_pat = re.compile(r'\s*\d+$')

data = []                                        # a list for the accounts

with open(file) as fd:
    for line in fd:
        if not inblock:
            if (len(line) > 100):
                row = [line[f[0]:f[1]].strip() for f in fieldpos]
                if account_pat.match(row[0]) and limit_pat.match(row[-1]):
                    inblock = True
                    data.append(row)
        else:
            line = line.strip()
            if len(line) > 0:
                row[2] += ', ' + line
            else:
                inblock = False

# we can now build a dataframe
df = pd.DataFrame(data, columns=['Account Number', 'Name', 'Address', 'Credit Limit'])

它最终给出:

   Account Number                 Name                                            Address Credit Limit
0            A001          Dan Ackroyd  Audenshaw, 125 New Street, Montreal, Quebec, H...        20000
1            A123           Mike Atsil  The Vetinary House, 123 Dog Row, Thunder Bay, ...        20000
2            A128            Ivan Aker            The Old House, Ottawa, Ontario, P1D 8D4        10000
3            B001         Kim Basinger    Mesh House, Fish Street, Rouyn, Quebec, J5V 2A9        12000
4            B002       Richard Burton  Eagle Castle, Leafy Lane, Sudbury, Ontario, L3...         9000
5            B004         Jeff Bridges  Arrow Road North, Lakeside, Kenora, Ontario, N...        20000
6            B008          Denise Bent  The Dance Studio, Covent Garden, Montreal, Que...        20000
7            B010          Carter Bout  Removals Close, No Fixed Abode Road, Toronto, ...        20000
8            B022         Ronnie Biggs     Gotaway Cottage, Thunder Bay, Ontario, K3A 6F3         5000
9            C001           Tom Cruise  The Firm, Gunnersbury, Waskaganish, Quebec, G1...        25000
10           C003           John Candy  The Sweet Shop, High Street, Trois Rivieres, Q...        15000

推荐阅读