python - 优化将 csv 文件读入数据帧的 python 脚本
问题描述
我编写了一个 python 脚本来将 csv 文件读入数据帧,根据两行(日期和事件)进行排序,然后从数据中生成两列 X|Y。数据包含必须独立处理的不同事件。
对于每个事件 ID;X 列是事件列表,因此它从 index = 0 增长到 index = len(listOfIncidents),而 Y 只是 ListOfIncident(index+1)。
假设这是已排序并准备好拆分为 X 和 Y
示例数据图像
的数据(为清楚起见省略了无用列)
下图显示了 X 和 Y 结果
示例结果图像此处
CSV 文件接近 500k 行大,所以首先;我将日期列转换为 Unix(以便于排序);然后对数据框进行排序。我创建了一个函数 create_input_output 来处理创建 X 和 Y 并附加到 maindfObj
import csv
import pandas as pd
import time
import datetime
# Function takes the computed IncidentActivity_Type and IncidentID as arguments
# Function Writes the IncidentActivity_Type into X and Y and create a CSV file using the IncidentID as file name
# Define the column Names
_ncols = ('X', 'Y')
# Define empty dataframe
maindfObj = pd.DataFrame([], columns=_ncols)
def create_input_output(xy, incident_id):
global maindfObj
# Define Empty List
values = []
xList = [];
i = 0
while i < len(xy):
try:
# Insert into X and Y so that X gets Index i and Y gets index i+1 etc
xList = xy[0:i+1]
values.append((xList, xy[i + 1]))
except:
xList = xy[0:i+1]
# We have reached the end of list. Complete with NULL
values.append((xList, "NULL"))
i = i + 1
# Write to dataframe
subdfObj = pd.DataFrame(values, columns=_ncols)
maindfObj = maindfObj.append(subdfObj)
print("Appended "+ incident_id + ". New size: ") #print the size
print(maindfObj.shape)
# Create a filename
#filename = incident_id + ".csv"
# Write dataframe to file
#subdfObj.to_csv(filename, sep=',')
#print("Appended " + incident_id)
# Point to CSV file to be read. If file is in the same folder as python source just write filename
from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/public/Incident_Activity.csv') as csv_file:
# Read file. Possible delimiters are , ; and white space
csv_reader = csv.reader(csv_file, delimiter=';')
line_count = 0
# Create a new list to hold the read data rows
_newarr = []
# Create Column
_columns = []
for row in csv_reader:
# First Row is the header
if line_count == 0:
print(f'Column names are {", ".join(row)}')
line_count += 1
_columns = row[0].split(',')
_columns = list(_columns)
# Appending new column UnixEquiv to hold computed date
_columns.append('UnixEquiv')
# Other Rows are the data
else:
# Split
rowstr = row[0].split(',')
# Convert to list
rowstr = list(rowstr)
# Convert dateTime to Unix Timestamp and append to last column
rowstr.append(time.mktime(datetime.datetime.strptime(rowstr[1], "%d/%m/%Y %H:%M").timetuple()))
# Append new row to _newarr
_newarr.append(tuple(rowstr))
line_count += 1
# Convert List to dataframe
dfObj = pd.DataFrame(_newarr, columns=_columns)
# Sort dataframe by the IncidentID and Unix Timestamp calculated earlier
dfObj = dfObj.sort_values(by=['Incident ID', 'UnixEquiv'])
# Create temporary XY List
_tempxy = []
# Initialize active Incident ID to empty
_activeIncident = ""
# Routine separates the big data based on Incident ID
for index, row in dfObj.iterrows():
if row['Incident ID'] == _activeIncident or _activeIncident == "":
# Incident ID has not changed
_tempxy.append(row['IncidentActivity_Type'])
_activeIncident = row['Incident ID']
else:
# Incident ID has changed.
_activeIncident = row['Incident ID']
# Let's write that to file before we start the next batch
create_input_output(_tempxy, _activeIncident)
# Clear old Incident ID
_tempxy.clear()
# Start new batch
_tempxy.append(row['IncidentActivity_Type'])
#Dataframe is ready here. Use
print(maindfObj.shape) #print the size
该代码有效,但完成时间太长。我知道这里有更好的人可以帮助我优化脚本。
解决方案
推荐阅读
- javascript - 如何使按钮在 JavaScript 的新选项卡中打开链接?
- python - 熊猫数据框掩码将值写入新列
- sed - 为什么选项卡不会插入到添加的第一行?
- java - 按下开始按钮后如何将初始延迟设置为特定时间(HH:mm:ss)格式
- php - 找不到文件时,公共函数不返回错误消息
- twitter-bootstrap - 如何修复 IE 中的这个下拉菜单?
- android - 如何在 Android 上获得超过 500 个联系人列表?
- reactjs - 当导入组件需要将操作作为道具传递时,打字稿 mapDispatchToProps 出错
- powershell - 从 CSV 文件创建批量分发列表的正确脚本
- nginx - 如何让 NGINX 在 tls1_2 而不是 tls1_3 上工作以获取特定 IP