首页 > 解决方案 > 优化将 csv 文件读入数据帧的 python 脚本

问题描述

我编写了一个 python 脚本来将 csv 文件读入数据帧,根据两行(日期和事件)进行排序,然后从数据中生成两列 X|Y。数据包含必须独立处理的不同事件。
对于每个事件 ID;X 列是事件列表,因此它从 index = 0 增长到 index = len(listOfIncidents),而 Y 只是 ListOfIncident(index+1)。
假设这是已排序并准备好拆分为 X 和 Y
示例数据图像
的数据(为清楚起见省略了无用列) 下图显示了 X 和 Y 结果
示例结果图像此处

CSV 文件接近 500k 行大,所以首先;我将日期列转换为 Unix(以便于排序);然后对数据框进行排序。我创建了一个函数 create_input_output 来处理创建 X 和 Y 并附加到 maindfObj

import csv
import pandas as pd
import time
import datetime


# Function takes the computed IncidentActivity_Type and IncidentID as arguments
# Function Writes the IncidentActivity_Type into X and Y and create a CSV file using the IncidentID as file name
# Define the column Names
_ncols = ('X', 'Y')
# Define empty dataframe
maindfObj = pd.DataFrame([], columns=_ncols)
def create_input_output(xy, incident_id):
    global maindfObj
    # Define Empty List
    values = []
    xList = [];
    i = 0
    while i < len(xy):
        try:
            # Insert into X and Y so that X gets Index i and Y gets index i+1 etc
            xList = xy[0:i+1]
            values.append((xList, xy[i + 1]))
        except:
            xList = xy[0:i+1]
            # We have reached the end of list. Complete with NULL
            values.append((xList, "NULL"))
        i = i + 1
    # Write to dataframe
    subdfObj = pd.DataFrame(values, columns=_ncols)
    maindfObj = maindfObj.append(subdfObj)
    print("Appended "+ incident_id + ". New size: ") #print the size
    print(maindfObj.shape)
    # Create a filename
    #filename = incident_id + ".csv"
    # Write dataframe to file
    #subdfObj.to_csv(filename, sep=',')
    #print("Appended " + incident_id)

# Point to CSV file to be read. If file is in the same folder as python source just write filename
from google.colab import drive
drive.mount('/content/drive')
with open('/content/drive/My Drive/public/Incident_Activity.csv') as csv_file:
    # Read file. Possible delimiters are , ; and white space
    csv_reader = csv.reader(csv_file, delimiter=';')
    line_count = 0
    # Create a new list to hold the read data rows
    _newarr = []
    # Create Column
    _columns = []
    for row in csv_reader:
        # First Row is the header
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
            _columns = row[0].split(',')
            _columns = list(_columns)
            # Appending new column UnixEquiv to hold computed date
            _columns.append('UnixEquiv')
        # Other Rows are the data
        else:
            # Split
            rowstr = row[0].split(',')
            # Convert to list
            rowstr = list(rowstr)
            # Convert dateTime to Unix Timestamp and append to last column
            rowstr.append(time.mktime(datetime.datetime.strptime(rowstr[1], "%d/%m/%Y %H:%M").timetuple()))
            # Append new row to _newarr
            _newarr.append(tuple(rowstr))
            line_count += 1
    # Convert List to dataframe
    dfObj = pd.DataFrame(_newarr, columns=_columns)
    # Sort dataframe by the IncidentID and Unix Timestamp calculated earlier
    dfObj = dfObj.sort_values(by=['Incident ID', 'UnixEquiv'])

    # Create temporary XY List
    _tempxy = []
    # Initialize active Incident ID to empty
    _activeIncident = ""
    # Routine separates the big data based on Incident ID
    for index, row in dfObj.iterrows():
        if row['Incident ID'] == _activeIncident or _activeIncident == "":
            # Incident ID has not changed
            _tempxy.append(row['IncidentActivity_Type'])
            _activeIncident = row['Incident ID']
        else:
            # Incident ID has changed.
            _activeIncident = row['Incident ID']
            # Let's write that to file before we start the next batch
            create_input_output(_tempxy, _activeIncident)
            # Clear old Incident ID
            _tempxy.clear()
            # Start new batch
            _tempxy.append(row['IncidentActivity_Type'])
    #Dataframe is ready here. Use
    print(maindfObj.shape) #print the size

该代码有效,但完成时间太长。我知道这里有更好的人可以帮助我优化脚本。

标签: pythonpandascsv

解决方案


推荐阅读