首页 > 解决方案 > Linux 中的大型 CSV/文本处理

问题描述

我有一组相当大的 CSV 文件(每条记录约 1M 条,15MB+),我需要提取、重新格式化然后附加到模板文本文件中的内容。

我编写了以下 python 脚本来执行此操作,它可以执行我想要的操作,但运行速度非常慢(单个文件的处理时间约为 15 分钟),并且我有数千个 CSV 文件要处理。

我写的脚本如下:

import pandas as pd
import shutil as s
from datetime import datetime as dt


dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'


df = pd.read_csv(dir + f)
df["DataDate"] = pd.to_datetime(df["DataDate"]).view('int64') 
df["Expiration"] = pd.to_datetime(df["Expiration"], format='%m/%d/%Y')

df["measurement"] = ['options' for t in range(len(df))]
lines = [str(df["measurement"][d])
        + ",type=options"
        + " " 
        + "ticker=" + str(df["UnderlyingSymbol"][d])
        + ",symbol=" + str(df["OptionSymbol"][d])
        + ",expDate=" + str(df["Expiration"][d])
        + ",cont=" + str(df["Type"][d])
        + ",price=" + str(df["UnderlyingPrice"][d])
        + ",strike=" + str(df["Strike"][d])
        + ",last=" + str(df["Last"][d])
        + ",bid=" + str(df["Bid"][d])
        + ",ask=" + str(df["Ask"][d])
        + ",volume=" + str(df["Volume"][d])
        + ",OI=" + str(df["OpenInterest"][d])
        + ",IV=" + str(df["IV"][d])
        + ",delta=" + str(df["Delta"][d])
        + ",gamma=" + str(df["Gamma"][d])
        + ",theta=" + str(df["Theta"][d])
        + ",vega=" + str(df["Vega"][d])
        + ",aka=" + str(df["AKA"][d])
        + " " + str(df["DataDate"][d]) for d in range(len(df))]


a = s.copyfile(dir + outFile, dir + 'import.txt')               
file = open(a, 'a+')
for item in lines:
        file.write("%s\n" % item)
file.close()

importbase.txt文件是模板文件。我需要将我的数据附加到其内容中,然后将其保存以import.txt进行处理。importbase、import 和我的 CSV 示例如下:

进口基地.txt

# DDL
CREATE DATABASE import

# DML
# CONTEXT-DATABASE: import

导入.txt

# DDL
CREATE DATABASE import

# DML
# CONTEXT-DATABASE: import
options,type=options ticker=A,symbol=A020420C00015000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=15.0,last=0.0,bid=21.0,ask=22.3,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0.0,theta=0.6848,vega=0.0aka=ADC 1017619200000000000
options,type=options ticker=A,symbol=A020420P00015000,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=15.0,last=0.0,bid=0.0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0.0,gamma=0.0,theta=0.0,vega=0.0aka=APC 1017619200000000000
options,type=options ticker=A,symbol=A020420C00017500,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=17.5,last=0.0,bid=18.7,ask=19.5,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0.0,theta=0.6371,vega=0.0aka=ADW 1017619200000000000
options,type=options ticker=A,symbol=A020420P00017500,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=17.5,last=0.0,bid=0.0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0.0,gamma=0.0,theta=0.0,vega=0.0aka=APW 1017619200000000000

20020401.csv

UnderlyingSymbol,UnderlyingPrice,Exchange,OptionSymbol,OptionExt,Type,Expiration,DataDate,Strike,Last,Bid,Ask,Volume,OpenInterest,IV,Delta,Gamma,Theta,Vega,AKA
A,36.53,*,A020420C00015000,,call,04/20/2002,04/01/2002,15,0,21,22.3,0,0,0.6145,0.9987,0,0.6848,0,ADC
A,36.53,*,A020420P00015000,,put,04/20/2002,04/01/2002,15,0,0,0.1,0,0,0.5964,0,0,0,0,APC
A,36.53,*,A020420C00017500,,call,04/20/2002,04/01/2002,17.5,0,18.7,19.5,0,0,0.6145,0.9987,0,0.6371,0,ADW
A,36.53,*,A020420P00017500,,put,04/20/2002,04/01/2002,17.5,0,0,0.1,0,0,0.5964,0,0,0,0,APW
A,36.53,*,A020420C00020000,,call,04/20/2002,04/01/2002,20,0,16.2,17,0,0,0.6145,0.9987,0,0.5882,0.0002,ADD
A,36.53,*,A020420P00020000,,put,04/20/2002,04/01/2002,20,0,0,0.1,0,0,0.5964,0,0,-0.0006,0.0001,APD
A,36.53,*,A020420C00022500,,call,04/20/2002,04/01/2002,22.5,5.7,13.7,14.5,0,5,0.6145,0.9985,0.0001,0.5092,0.0053,ADX
A,36.53,*,A020420P00022500,,put,04/20/2002,04/01/2002,22.5,0,0,0.1,0,1545,0.5964,-0.0001,0.0001,-0.0214,0.0036,APX
A,36.53,*,A020420C00025000,,call,04/20/2002,04/01/2002,25,11.6,11.2,12,0,545,0.6145,0.9963,0.0015,0.1189,0.0612,ADE
A,36.53,*,A020420P00025000,,put,04/20/2002,04/01/2002,25,0,0,0.1,0,882,0.5964,-0.0019,0.0012,-0.2882,0.0485,APE
A,36.53,*,A020420C00030000,,call,04/20/2002,04/01/2002,30,6.5,6.4,6.9,80,2357,0.6145,0.9314,0.0259,-6.1328,1.0651,ADF

我使用了相当多的 bash,但想知道 sed 或 awk 之类的东西是否更适合像这样的较低级别的任务,因为处理时间可能是一个问题。任何有关方法的指导表示赞赏。

标签: pythonbashcsvawksed

解决方案


我犹豫将其发布为答案,因为它只是猜测。但是您可能会通过迭代数据框行并格式化单个字符串而不是连接一堆字符串来节省时间。

import pandas as pd
import shutil as s
from datetime import datetime as dt

dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'


df = pd.read_csv(dir + f)
df["DataDate"] = pd.to_datetime(df["DataDate"]).view('int64') 
df["Expiration"] = pd.to_datetime(df["Expiration"], format='%m/%d/%Y')

df["measurement"] = ['options' for t in range(len(df))]

a = s.copyfile(dir + outFile, dir + 'import.txt')               
with open(a, 'a+') as file:
    for index, row in df.iterrows():
        file.write(
            "{measurement},type=options "
            "ticker={UnderlyingSymbol}"
            ",symbol={OptionSymbol}"
            ",expDate={Expiration}"
            ",cont={Type}"
            ",price={UnderlyingPrice}"
            ",strike={Strike}"
            ",last={Last}"
            ",bid={Bid}"
            ",ask={Ask}"
            ",volume={Volume}"
            ",OI={OpenInterest}"
            ",IV={IV}"
            ",delta={Delta}"
            ",gamma={Gamma}"
            ",theta={Theta}"
            ",vega={Vega}"
            ",aka={AKA}"
            " {DataDate}".format_map(row))

事实上,熊猫在这里并没有真正的优势。构建数据框占用了大量空间,但无论如何都需要通过 python 变量进行格式化,所以只需使用常规 CSV 模块

import csv
import shutil as s
from datetime import datetime as dt

dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'

a = s.copyfile(dir + outFile, dir + 'import.txt')               

with open(dir + f) as in_fp, open(a, "a") as out_fp:
    reader = csv.DictReader(in_fp)
    for row in reader:
        # todo: figure out conversions outside pandas
        #row["DataDate"] = ?
        #row["Expiration"] = ? 
        row["measurement"] = "options"
        file.write(
            "{measurement},type=options "
            "ticker={UnderlyingSymbol}"
            ",symbol={OptionSymbol}"
            ",expDate={Expiration}"
            ",cont={Type}"
            ",price={UnderlyingPrice}"
            ",strike={Strike}"
            ",last={Last}"
            ",bid={Bid}"
            ",ask={Ask}"
            ",volume={Volume}"
            ",OI={OpenInterest}"
            ",IV={IV}"
            ",delta={Delta}"
            ",gamma={Gamma}"
            ",theta={Theta}"
            ",vega={Vega}"
            ",aka={AKA}"
            " {DataDate}".format_map(row))

推荐阅读