python - Python 从大型文本文件中读取完整行的块(列值拆分为多行)
问题描述
我想分块读取一个大的 .txt 文件(c.2.5GB),然后在加载到数据库之前执行一些操作。
该文件只有 2 列(列分隔符为¬
),并以double quotes
. 第二列中的值可能跨越多行(示例如下)。我想过使用这个答案,但问题是它可能会处理不完整的行,因为它取决于预设chunk size
。有人可以帮忙吗?我在下面包含了示例数据和代码。
样本数据 (Sample_load_file.txt)
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
*******************************************************************
This line also includes the column delimiter within text qualifier
*******************************************************************
# !¬!¬!¬|
"
代码
import pandas as pd
import os
from dbconnection import DBConnection
path = r'C:\Sample_load_file.txt'
db = DBConnection(server ='XXXX', database='XXXX')
def read_in_chunks(file_object, chunk_size=1024):
#Lazy load to read a file piece by piece (avoiding moemory issues)
#Default chunk size: 1k.
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
def process_chunk(data=piece):
#Build a list of lines based on ' "\n" ' as custom separator
data = data.split('"\n"')
#Split each line based on ' "¬" ' as custom separator
data = [line.split('"¬"') for line in data]
#Cleanup remaining double quotes
data = [[e.replace('"', '') for e in line] for line in data]
#Check the number of columns
number_of_cols = len(str(data[0]).split('¬'))
number_of_cols
#Load data into a dataframe
df = pd.DataFrame(data)
#Reformat dataframe
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
#Split the first column into two
try:
df[['LINE_ID', 'LINE_TEXT']] = df['LINE_ID¬LINE_TEXT'].str.split('¬',expand=True)
except:
print('Error')
del df['LINE_ID¬LINE_TEXT']
#Add metadata
df['loaded_by'] = 'XXXX'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'Load_Extracts.ipynb'
#Load in SQL db
df.to_sql('SQL_table_name', db.engine, schema='dbo', index=False, if_exists='append')
#Load text file
with open(path) as f:
for piece in read_in_chunks(f):
process_data(piece)
解决方案
如果LINE_ID
适合一行,您可以尝试使用利用多行记录的第一行包含的生成器"¬"
:
def make_records(file):
current = []
for line in file:
line = line.rstrip()
if '"¬"' in line:
if current:
yield " ".join(current)
current = [line]
else:
current.append(line)
yield " ".join(current)
使用示例输入:
>>> import io
>>>
>>> s = '''"LINE_ID"¬"LINE_TEXT"
... "C1111-G00-BC222"¬"this line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... "
... "C22-f0-333"¬"2nd row. This line is
... split into
... multiple lines
... % All needs to be read into 1 line
... % Currently that's not happening
... *******************************************************************
... This line also includes the column delimiter within text qualifier
... *******************************************************************
... # !¬!¬!¬|
... "'''
>>> f = io.StringIO(s)
>>> for record in make_records(f):
... print(record)
...
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening "
"C22-f0-333"¬"2nd row. This line is split into multiple lines % All needs to be read into 1 line % Currently that's not happening ******************************************************************* This line also includes the column delimiter within text qualifier ******************************************************************* # !¬!¬!¬| "
注意:您可能想要更改生成器yield
的内容,例如,list
或者tuple
代替str
,删除双引号,根据您的需要跳过第一行。我io.StringIO
仅用于说明目的,您将从“正常”文件中读取。
推荐阅读
- android - 卸载应用程序是否会删除应用程序使用的设备所有者设置?
- python - 如何重新启动kivy?
- c# - 解析日志文件,不明确的分隔符
- python - 使用 BeautifulSoup 解析却一无所获
- sql-server - 如何将文件和其他信息发布到控制器并将其保存在 sql server 中?
- python - 从一个整数列表到另一个整数列表的最小转换次数
- postgresql - 如何在 PostgreSQL 中锁定 SELECT?
- r - 当情节中有一些阿拉伯或波斯字母时,如何使用逆戟鲸保存情节?
- azure - 我的查询是否有任何问题以获取我的令牌?
- cluster-analysis - 查找 DBLP 数据集中的簇数