首页 > 解决方案 > 相同的 Python 代码,相同的数据,如果数据导入与否,结果是否不同?

问题描述

所以我有一个 Python 代码,它首先将数据聚合并标准化到一个我称为“tripFile”的文件中。然后代码尝试识别这个最近的tripFile 和以前的tripFile 之间的差异。

从代码的第一部分开始,如果我导出tripFile,然后为代码的第二部分再次导入它,运行它大约需要5 分钟,并且说它正在循环超过4,000 个对象。

newTripFile = pd.read_csv(PATH + today + ' Trip File v6.csv')

但是,如果我不导出并重新导入数据(只是将其保留在代码的第一部分中),则需要不到 24 小时(!!)并表示它正在循环超过 951,691 个对象.

newTripFile = tripFile

我的数据是一个数据框,并检查了它的形状,它与我导出的文件相同。

知道是什么原因造成的吗???

这是我的代码的第二部分:

oldTripFile = pd.read_excel(PATH + OLDTRIPFILE)
oldTripFile.drop(['id'], axis = 1, inplace = True)
oldTripFile['status'] = 'old'

# New version of trip file
newTripFile = pd.read_csv(PATH + today + ' Trip File v6.csv')
newTripFile.drop(['id'], axis = 1, inplace = True)
newTripFile['status'] = 'new'

db_trips = pd.concat([oldTripFile, newTripFile]) #concatenation of the two dataframes
db_trips = db_trips.reset_index(drop = True)
db_trips.drop_duplicates(keep = False, subset = [column for column in db_trips.columns[:-1] ], inplace = True)
db_trips = db_trips.reset_index(drop = True)
db_trips.head()
update_details = []

# Get the duplicates : only consider ['fromCode', 'toCode', 'mode'] for identifying duplicates
# Create a dataframe that contains only the trips that was deleted and was recently added
db_trips_delete_new = db_trips.drop_duplicates(keep = False, subset = ['fromCode', 'toCode', 'mode'])
db_trips_delete_new = db_trips_delete_new.reset_index(drop = True)

# New trips
new_trips = db_trips_delete_new[db_trips_delete_new['status'] == 'new'].values.tolist()
for trip in new_trips:
    trip.append('new trip added') 
update_details = update_details + new_trips


# Deleted trips
old_trips = db_trips_delete_new[db_trips_delete_new['status'] == 'old'].values.tolist()
for trip in old_trips:
    trip.append('trip deleted')
update_details = update_details + old_trips

db_trips_delete_new.head()

# Updated trips

# Ocean: no need to check the transit time column
sea_trips = db_trips.loc[db_trips['mode'].isin(['sea', 'cfs'])]
sea_trips = sea_trips.reset_index(drop = True)
list_trips_sea_update = sea_trips[sea_trips.duplicated(subset = ['fromCode', 'toCode', 'mode'], keep = False)].values.tolist()


if len(list_trips_sea_update) != 0:
    for i in tqdm(range(0, len(list_trips_sea_update) - 1)):
        for j in range(i + 1, len(list_trips_sea_update)):
            if list_trips_sea_update[i][2] == list_trips_sea_update[j][2] and list_trips_sea_update[i][9] == list_trips_sea_update[j][9] and list_trips_sea_update[i][14] == list_trips_sea_update[j][14]:
                update_comment = ''
                
                # Check display from / to
                if list_trips_sea_update[i][5] != list_trips_sea_update[j][5]:
                    update_comment = update_comment + 'fromDisplayLocation was updated.'
                if list_trips_sea_update[i][12] != list_trips_sea_update[j][12]:
                    update_comment = update_comment + 'toDisplayLocation was updated.'
                
                # Get the updated trip (the row with status new)
                if list_trips_sea_update[i][17] == 'new' and list_trips_sea_update[j][17] != 'new' :
                    list_trips_sea_update[i].append(update_comment)
                    update_details = update_details + [list_trips_sea_update[i]]
                else:
                    if list_trips_sea_update[j][17] == 'new' and list_trips_sea_update[i][17] != 'new':
                        list_trips_sea_update[j].append(update_comment)
                        update_details = update_details + [list_trips_sea_update[j]]
                    else:
                        print('excel files are not organized')

# Ground: transit time column need to be checked
ground_trips = db_trips[~db_trips['mode'].isin(['sea', 'cfs'])]
ground_trips = ground_trips.reset_index(drop = True)
list_trips_ground_update = ground_trips[ground_trips.duplicated(subset = ['fromCode', 'toCode', 'mode'], keep = False)].values.tolist()

if len(list_trips_ground_update) != 0:
    for i in tqdm(range(0, len(list_trips_ground_update) - 1)):
        for j in range(i + 1, len(list_trips_ground_update)):
            if list_trips_ground_update[i][2] == list_trips_ground_update[j][2] and list_trips_ground_update[i][9] == list_trips_ground_update[j][9] and list_trips_ground_update[i][14] == list_trips_ground_update[j][14]:
                update_comment = ''
                
                # Check display from / to
                if list_trips_ground_update[i][5] != list_trips_ground_update[j][5]:
                    update_comment = update_comment + 'fromDisplayLocation was updated.'
                if list_trips_ground_update[i][12] != list_trips_ground_update[j][12]:
                    update_comment = update_comment + 'toDisplayLocation was updated.'
                
                # Check transit time
                if list_trips_ground_update[i][15] != list_trips_ground_update[j][15]:
                    update_comment = update_comment + 'transit time was updated.'
                
                # Get the updated trip (the row with status new)
                if list_trips_ground_update[i][17] == 'new' and list_trips_ground_update[j][17] != 'new' :
                    list_trips_ground_update[i].append(update_comment)
                    update_details=update_details + [list_trips_ground_update[i]]
                else:
                    if list_trips_ground_update[j][17] == 'new' and list_trips_ground_update[i][17] != 'new':
                        list_trips_ground_update[j].append(update_comment)
                        update_details = update_details + [list_trips_ground_update[j]]
                    else:
                        print('excel files are not organized')

And here an example of what my trip file looks like:

enter image description here

Any help is appreciated :)

标签: pythonpandasdataframe

解决方案


If ever it can be useful to someone else, issue was coming from the type. When keeping my tripFile in memory, one of my column was "10.0" for example, whereas when imported this column was "10". As I'm comparing with another imported tripFile, if both files are imported the column in both files are of same type, but if one of the files is kept in memory the column is of different type in both files and considered as updated. As such takes much longer when kept in memory as every row is considered updated.


推荐阅读