python - Linux 中的大型 CSV/文本处理
问题描述
我有一组相当大的 CSV 文件(每条记录约 1M 条,15MB+),我需要提取、重新格式化然后附加到模板文本文件中的内容。
我编写了以下 python 脚本来执行此操作,它可以执行我想要的操作,但运行速度非常慢(单个文件的处理时间约为 15 分钟),并且我有数千个 CSV 文件要处理。
我写的脚本如下:
import pandas as pd
import shutil as s
from datetime import datetime as dt
dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'
df = pd.read_csv(dir + f)
df["DataDate"] = pd.to_datetime(df["DataDate"]).view('int64')
df["Expiration"] = pd.to_datetime(df["Expiration"], format='%m/%d/%Y')
df["measurement"] = ['options' for t in range(len(df))]
lines = [str(df["measurement"][d])
+ ",type=options"
+ " "
+ "ticker=" + str(df["UnderlyingSymbol"][d])
+ ",symbol=" + str(df["OptionSymbol"][d])
+ ",expDate=" + str(df["Expiration"][d])
+ ",cont=" + str(df["Type"][d])
+ ",price=" + str(df["UnderlyingPrice"][d])
+ ",strike=" + str(df["Strike"][d])
+ ",last=" + str(df["Last"][d])
+ ",bid=" + str(df["Bid"][d])
+ ",ask=" + str(df["Ask"][d])
+ ",volume=" + str(df["Volume"][d])
+ ",OI=" + str(df["OpenInterest"][d])
+ ",IV=" + str(df["IV"][d])
+ ",delta=" + str(df["Delta"][d])
+ ",gamma=" + str(df["Gamma"][d])
+ ",theta=" + str(df["Theta"][d])
+ ",vega=" + str(df["Vega"][d])
+ ",aka=" + str(df["AKA"][d])
+ " " + str(df["DataDate"][d]) for d in range(len(df))]
a = s.copyfile(dir + outFile, dir + 'import.txt')
file = open(a, 'a+')
for item in lines:
file.write("%s\n" % item)
file.close()
该importbase.txt
文件是模板文件。我需要将我的数据附加到其内容中,然后将其保存以import.txt
进行处理。importbase、import 和我的 CSV 示例如下:
进口基地.txt
# DDL
CREATE DATABASE import
# DML
# CONTEXT-DATABASE: import
导入.txt
# DDL
CREATE DATABASE import
# DML
# CONTEXT-DATABASE: import
options,type=options ticker=A,symbol=A020420C00015000,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=15.0,last=0.0,bid=21.0,ask=22.3,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0.0,theta=0.6848,vega=0.0aka=ADC 1017619200000000000
options,type=options ticker=A,symbol=A020420P00015000,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=15.0,last=0.0,bid=0.0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0.0,gamma=0.0,theta=0.0,vega=0.0aka=APC 1017619200000000000
options,type=options ticker=A,symbol=A020420C00017500,expDate=2002-04-20 00:00:00,cont=call,price=36.53,strike=17.5,last=0.0,bid=18.7,ask=19.5,volume=0,OI=0,IV=0.6145,delta=0.9987,gamma=0.0,theta=0.6371,vega=0.0aka=ADW 1017619200000000000
options,type=options ticker=A,symbol=A020420P00017500,expDate=2002-04-20 00:00:00,cont=put,price=36.53,strike=17.5,last=0.0,bid=0.0,ask=0.1,volume=0,OI=0,IV=0.5964,delta=0.0,gamma=0.0,theta=0.0,vega=0.0aka=APW 1017619200000000000
20020401.csv
UnderlyingSymbol,UnderlyingPrice,Exchange,OptionSymbol,OptionExt,Type,Expiration,DataDate,Strike,Last,Bid,Ask,Volume,OpenInterest,IV,Delta,Gamma,Theta,Vega,AKA
A,36.53,*,A020420C00015000,,call,04/20/2002,04/01/2002,15,0,21,22.3,0,0,0.6145,0.9987,0,0.6848,0,ADC
A,36.53,*,A020420P00015000,,put,04/20/2002,04/01/2002,15,0,0,0.1,0,0,0.5964,0,0,0,0,APC
A,36.53,*,A020420C00017500,,call,04/20/2002,04/01/2002,17.5,0,18.7,19.5,0,0,0.6145,0.9987,0,0.6371,0,ADW
A,36.53,*,A020420P00017500,,put,04/20/2002,04/01/2002,17.5,0,0,0.1,0,0,0.5964,0,0,0,0,APW
A,36.53,*,A020420C00020000,,call,04/20/2002,04/01/2002,20,0,16.2,17,0,0,0.6145,0.9987,0,0.5882,0.0002,ADD
A,36.53,*,A020420P00020000,,put,04/20/2002,04/01/2002,20,0,0,0.1,0,0,0.5964,0,0,-0.0006,0.0001,APD
A,36.53,*,A020420C00022500,,call,04/20/2002,04/01/2002,22.5,5.7,13.7,14.5,0,5,0.6145,0.9985,0.0001,0.5092,0.0053,ADX
A,36.53,*,A020420P00022500,,put,04/20/2002,04/01/2002,22.5,0,0,0.1,0,1545,0.5964,-0.0001,0.0001,-0.0214,0.0036,APX
A,36.53,*,A020420C00025000,,call,04/20/2002,04/01/2002,25,11.6,11.2,12,0,545,0.6145,0.9963,0.0015,0.1189,0.0612,ADE
A,36.53,*,A020420P00025000,,put,04/20/2002,04/01/2002,25,0,0,0.1,0,882,0.5964,-0.0019,0.0012,-0.2882,0.0485,APE
A,36.53,*,A020420C00030000,,call,04/20/2002,04/01/2002,30,6.5,6.4,6.9,80,2357,0.6145,0.9314,0.0259,-6.1328,1.0651,ADF
我使用了相当多的 bash,但想知道 sed 或 awk 之类的东西是否更适合像这样的较低级别的任务,因为处理时间可能是一个问题。任何有关方法的指导表示赞赏。
解决方案
我犹豫将其发布为答案,因为它只是猜测。但是您可能会通过迭代数据框行并格式化单个字符串而不是连接一堆字符串来节省时间。
import pandas as pd
import shutil as s
from datetime import datetime as dt
dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'
df = pd.read_csv(dir + f)
df["DataDate"] = pd.to_datetime(df["DataDate"]).view('int64')
df["Expiration"] = pd.to_datetime(df["Expiration"], format='%m/%d/%Y')
df["measurement"] = ['options' for t in range(len(df))]
a = s.copyfile(dir + outFile, dir + 'import.txt')
with open(a, 'a+') as file:
for index, row in df.iterrows():
file.write(
"{measurement},type=options "
"ticker={UnderlyingSymbol}"
",symbol={OptionSymbol}"
",expDate={Expiration}"
",cont={Type}"
",price={UnderlyingPrice}"
",strike={Strike}"
",last={Last}"
",bid={Bid}"
",ask={Ask}"
",volume={Volume}"
",OI={OpenInterest}"
",IV={IV}"
",delta={Delta}"
",gamma={Gamma}"
",theta={Theta}"
",vega={Vega}"
",aka={AKA}"
" {DataDate}".format_map(row))
事实上,熊猫在这里并没有真正的优势。构建数据框占用了大量空间,但无论如何都需要通过 python 变量进行格式化,所以只需使用常规 CSV 模块
import csv
import shutil as s
from datetime import datetime as dt
dir = '/media/E/data/idb/'
f = '20020401.csv'
outFile = 'importbase.txt'
a = s.copyfile(dir + outFile, dir + 'import.txt')
with open(dir + f) as in_fp, open(a, "a") as out_fp:
reader = csv.DictReader(in_fp)
for row in reader:
# todo: figure out conversions outside pandas
#row["DataDate"] = ?
#row["Expiration"] = ?
row["measurement"] = "options"
file.write(
"{measurement},type=options "
"ticker={UnderlyingSymbol}"
",symbol={OptionSymbol}"
",expDate={Expiration}"
",cont={Type}"
",price={UnderlyingPrice}"
",strike={Strike}"
",last={Last}"
",bid={Bid}"
",ask={Ask}"
",volume={Volume}"
",OI={OpenInterest}"
",IV={IV}"
",delta={Delta}"
",gamma={Gamma}"
",theta={Theta}"
",vega={Vega}"
",aka={AKA}"
" {DataDate}".format_map(row))
推荐阅读
- javascript - 组件中映射的 vuex getter 未更新
- python - 如何将多个深度嵌套的 JSON 文件展平为 pandas 数据框?
- swift - 在两个单独的类中使用一个对象的最佳方法是什么?
- python - 通过单击获取 numpy 数组图像中某个点的索引
- javascript - 如何在特定时间后更新快速 js 路由器中的项目?
- php - 为什么这段代码使用 array_filter 和 array_map 来过滤整数数组?
- php - 将对象复制到数组将其转换为字符串
- php - 如何从关系 laravel 将列附加到模型
- sql - 如何计算一年中的滑动日计数SQL
- autodesk-forge - 在 Forge 中创建活动时出错 [严重]