首页 > 解决方案 > Python 从大型文本文件中读取完整行的块(列值拆分为多行)

问题描述

我想分块读取一个大的 .txt 文件(c.2.5GB),然后在加载到数据库之前执行一些操作。

该文件只有 2 列(列分隔符为¬),并以double quotes. 第二列中的值可能跨越多行(示例如下)。我想过使用这个答案,但问题是它可能会处理不完整的行,因为它取决于预设chunk size。有人可以帮忙吗?我在下面包含了示例数据和代码。

样本数据 (Sample_load_file.txt)

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This line also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"

代码

import pandas as pd
import os
from dbconnection import DBConnection

path = r'C:\Sample_load_file.txt'
db = DBConnection(server ='XXXX', database='XXXX')

def read_in_chunks(file_object, chunk_size=1024):
    #Lazy load to read a file piece by piece (avoiding moemory issues)
    #Default chunk size: 1k.
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data
        
def process_chunk(data=piece):
    #Build a list of lines based on ' "\n" ' as custom separator
    data = data.split('"\n"')
    
    #Split each line based on ' "¬" ' as custom separator
    data = [line.split('"¬"') for line in data]
    
    #Cleanup remaining double quotes
    data = [[e.replace('"', '') for e in line] for line in data]
    
    #Check the number of columns
    number_of_cols = len(str(data[0]).split('¬'))
    number_of_cols
    
    #Load data into a dataframe
    df = pd.DataFrame(data)
    
    #Reformat dataframe
    df.columns = df.iloc[0] # Set first row as column index
    df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
    
    #Split the first column into two
    try:
        df[['LINE_ID', 'LINE_TEXT']] = df['LINE_ID¬LINE_TEXT'].str.split('¬',expand=True)
    except:
        print('Error')
    del df['LINE_ID¬LINE_TEXT']
    
    #Add metadata
    df['loaded_by'] = 'XXXX'
    df['file_line_number'] = range(2,len(df)+2)
    df['load_date'] = pd.datetime.now()
    df['source_file'] = path
    df['loading_script'] = r'Load_Extracts.ipynb'    
    
    #Load in SQL db
    df.to_sql('SQL_table_name', db.engine, schema='dbo', index=False, if_exists='append')
    
#Load text file
with open(path) as f:
    for piece in read_in_chunks(f):
        process_data(piece)

标签: pythontext-files

解决方案


如果LINE_ID适合一行,您可以尝试使用利用多行记录的第一行包含的生成器"¬"

def make_records(file):
    current = []
    for line in file:
        line = line.rstrip()
        if '"¬"' in line:
            if current:
                yield " ".join(current)
            current = [line]
        else:
            current.append(line)
    yield " ".join(current)

使用示例输入:

>>> import io
>>> 
>>> s = '''"LINE_ID"¬"LINE_TEXT"
... "C1111-G00-BC222"¬"this line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... "
... "C22-f0-333"¬"2nd row. This line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
...   *******************************************************************
...   This line also includes the column delimiter within text qualifier
...   *******************************************************************
...   # !¬!¬!¬|
... "'''
>>> f = io.StringIO(s)
>>> for record in make_records(f):
...    print(record)
... 
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening "
"C22-f0-333"¬"2nd row. This line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening   *******************************************************************   This line also includes the column delimiter within text qualifier   *******************************************************************   # !¬!¬!¬| "

注意:您可能想要更改生成器yield的内容,例如,list或者tuple代替str,删除双引号,根据您的需要跳过第一行。我io.StringIO仅用于说明目的,您将从“正常”文件中读取。


推荐阅读