首页 > 解决方案 > Python(或一般的 CS)中的循环或分层字典式数据结构?

问题描述

我目前不知所措,因为我有一个巨大的 Pandas DataFrame(超过 100 万行)并且我正在查看 3 列 - 即以下内容:

Company_Name     Business_ID    Location
ABC, Inc.         BY2389AS        MTV
ABC, Inc.          100020         LMD
XYZW               010012         MTV
XYZW               010012         LMD
XYZW              AB23JKF         QAT
                  BA23F3              
SomethingCo        2342
SomethingCo                       ALD

可以看出,有时某些字段会丢失。我想根据给定的注册表检查这个(这包含数百万个 CSV 格式的三元组(Company_Name,Business_ID,Location),如果存在唯一匹配,请尝试返回缺少的字段(如果存在唯一匹配) .

注册表在 CSV 格式中看起来像这样:

Company_Name, Business_ID, Location
ABC, Inc., BY2389AS, MTV
ABC, Inc., 100020, LMD
XYZW, 010012, MTV
XYZW, 010012, LMD
XYZW, AB23JKF, QAT
DLCComp, BA23F3, PLT
DLCComp, 234XYZ, QAT            
SomethingCo, 2342, COD
SomethingCo, 2020 , ALD

如上所示,这个 CSV 文件没有任何遗漏。

需要注意的是,执行 DataFrame groupby、pivot table、stack/unstack 甚至逻辑查找并选择数据帧的子集会减慢速度(因为查看整个注册表需要很长时间。我有一套逻辑要走如果缺少某些字段,则查看注册表以处理唯一匹配并填写缺少的字段;否则,如果无法识别唯一匹配,则按原样返回。

字典查找似乎很理想 - 但由于可能缺少 3 个字段的任何组合,因此我无法从这个巨大的注册表数据帧(我出于当前目的将其读入内存)创建字典并通过其中一列创建键。

我正在尝试考虑哪种数据框最适合处理此问题,并考虑循环或分层字典(如果存在),但这也有点复杂,因为查找必须根据不同的键值对进行调整数据库与注册表中缺少什么值。根据可用的列值重新排列或子集数据帧的成本需要一段时间 - 我试图找出是否有一个好的数据结构或算法可以解决这个问题。我也试图看看 Pandas 是否有一种非常有效的方法来通过逻辑操作(这是我目前正在做的)对数据帧进行子集:

# Example for if Business_ID is populated, but both Company_Name and Location are not:
def specific_case_func_for_demo_purposes(company_name, business_id, location):
    if not company_name and business_id and not location:
        subset_df = registry_df[registry_df[Business_ID] == business_id_im_looking_for]

        if len(subset_df) == 0:
            return company_name, business_id, location
        elif len(subset_df) == 1:
            return subset_df['Company_Name'], business_id, subset_df['Location']
        else:
            # handle case when there are multiple business_id matches by seeing if company name is unique, since company name can be identified by business ID:
            if len(subset_df['Company_Name'].unique()) == 1:
                return subset_df['Company_Name'].iloc[0], business_id, location
            else:
                # can't possible know if there is a unique match - so just return the empty company_name and location
                return company_name, business_id, location

当 Business_ID 已填充但 Company_Name 和 Location 未填充时,这只是一个处理这种特殊情况的函数。可以看出,这可能会令人费解。我目前正在处理所有 8 个案例(其中一些可以减少为重复案例或基本相同的案例,所以总而言之,大约 4 个案例和几个子案例),但这在设计和性能方面似乎都非常低效。当使用行数 = 800,000 的注册表 CSV 数据的子集并在大约 400 个数据点上执行此类逻辑时,这需要 35 秒,使用 %timeit 的标准偏差为 128 毫秒。我使用 df.apply 使用我设计的主要功能来计时。

我想知道一种数据结构,它可以在将其应用于更大的数据帧时提供良好的查找和设置时间 - 我知道必须有一种更有效的方法,无论是有效地使用 Pandas 方法还是不同的数据结构,例如相互查找字典(或循环或分层字典,如果存在这些概念)、潜在的基于树的方法或其他东西。我欢迎对此提出任何意见。

标签: pythonpandasdataframedictionarydata-structures

解决方案


解决方案 1 - 优化数据框搜索

import sys
import time
import timeit
import random
import pandas as pd
# Creating dummy data for 1 million rows
number_of_rows = 1000000 # 1 million
list_Temp = [('abcdefghi_'+str(i), 'jklmnopqrs_'+str(i), 'tuvwxyz_'+str(i)) for i in range(number_of_rows)]

# This is your registry_df
df = pd.DataFrame(list_Temp, columns = ['A' , 'B', 'C']) 
print('Dummy registry_df size =', (sys.getsizeof(df)/1000000), 'MB')
df.head()

输出:

Dummy registry_df size = 217.666774 MB

        A               B             C
0   abcdefghi_0   jklmnopqrs_0    tuvwxyz_0
1   abcdefghi_1   jklmnopqrs_1    tuvwxyz_1
2   abcdefghi_2   jklmnopqrs_2    tuvwxyz_2
3   abcdefghi_3   jklmnopqrs_3    tuvwxyz_3
4   abcdefghi_4   jklmnopqrs_4    tuvwxyz_4

>

# Creating 'n' random numbers for searching in registry_df
n = 100
list_random_numbers = [random.randrange(0, number_of_rows) for i in range(n)]
# Time taken for searching for n values in registry_df. You need to select one of the best from option

%timeit for number in list_random_numbers: df_Temp = df.loc[df['A'] == 'abcdefghi_'+str(number)]
%timeit for number in list_random_numbers: df_Temp = df[df['A'] == 'abcdefghi_'+str(number)]
%timeit for number in list_random_numbers: df_Temp = df[df.A.eq("'abcdefghi_"+str(number)+"'")]
%timeit for number in list_random_numbers: df_Temp = df.query(('A == ' + ("'abcdefghi_"+str(number)+"'")))
%timeit for number in list_random_numbers: df_Temp = df.eval("A == 'abcdefghi_"+str(number)+"'")
%timeit for number in list_random_numbers: df_Temp = df[df.A.isin( ["A == 'abcdefghi_"+str(number)+"'"])]

输出 :

11.9 s ± 338 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.4 s ± 441 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
13 s ± 756 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.57 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.22 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.99 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

解决方案 2 - 如果您有一点额外的内存。在相应的列字典中存储索引

# Searching dataframe using indexes
# For demo Creating list of 'n' random numbers for searching in registry_df based on indexes

n = 100
list_random_numbers= [random.randrange(0, number_of_rows) for i in range(n)]

# Fetching rows based on index

%timeit for number in list_random_numbers: row = df[df.index.isin([number])]
%timeit for number in list_random_numbers: row = df.iloc[[number]]

输出:

948 ms ± 64.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
27.7 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

> 创建包含索引的字典

dict_Company_Name ={}
dict_Business_ID = {}
dict_Location = {}

def Create_Dictionary_From_DataFrame(df, dict_Company_Name, dict_Business_ID, dict_Location):
    for row_Index in range(len(df)):
        comany_name, business_id, location = df.iloc[row_Index]
        dict_Company_Name.setdefault(comany_name, []).append(row_Index)
        dict_Business_ID.setdefault(business_id, []).append(row_Index)
        dict_Location.setdefault(location, []).append(row_Index)

%time Create_Dictionary_From_DataFrame(df, dict_Company_Name, dict_Business_ID, dict_Location)

Wall time: 2min 12s

# Size in MB. For you this will vary as I have created dummy data
print((sys.getsizeof(dict_Company_Name))/1000000)
print((sys.getsizeof(dict_Business_ID))/1000000)
print((sys.getsizeof(dict_Company_Name))/1000000)

41.943144
41.943144
41.943144
# Searching based on indexes

n = 10000 # 1
list_random_numbers = [random.randrange(0, number_of_rows) for i in range(n)]

def specific_case_func_business_id():

    for row_Index in list_random_numbers:
        business_id = 'jklmnopqrs_' + str(row_Index)

        if business_id in dict_Business_ID:
            list_Indexes_Business_ID = dict_Business_ID[business_id]

            if len(list_Indexes_Business_ID) > 0:
                subset_df = df.iloc[list_Indexes_Business_ID]           
                """
                Your case specific code
                """
            else:
                raise Exception('No matching data found for Business ID = ' + business_id )

%timeit specific_case_func_business_id()

输出:

2.99 s ± 278 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

正如在 dataframe 中搜索所见,最优化的方法仅用了大约 4 秒来搜索 100 个值。而使用索引搜索所有 10000 个值大约需要 3 秒

此外,如果您有 2 个值和 1 个缺失值要查找。您需要从相应的字典中查找索引并使用以下函数来获取公共索引,最后从公共索引列表中获取 subset_df

def Get_Common_List_Values(list1, list2):
    if (list1 is None) and (list2 is not None): return list2
    if (list2 is None) and (list1 is not None): return list1

    if (list2 is not None) and (list1 is not None):
        return [row_Index for row_Index in list1 if row_Index in list2]

注:所有时间均根据虚拟数据计算。您的实际时间会有所不同


推荐阅读