首页 > 解决方案 > 屏蔽和查询 pandas.DataFrame 的区别

问题描述

我的示例显示,在使用浮点数的 DataFrame 时,在某些情况下查询可能比使用掩码更快。When you look at the graph, the q̶u̶e̶r̶y̶ ̶f̶u̶n̶c̶t̶i̶o̶n̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶b̶e̶t̶t̶e̶r̶ ̶w̶h̶e̶n̶ ̶t̶h̶e̶ ̶c̶o̶n̶d̶i̶t̶i̶o̶n̶ ̶i̶s̶ ̶c̶o̶m̶p̶o̶s̶e̶d̶ ̶o̶f̶ ̶1̶ ̶t̶o̶ ̶5̶ ̶s̶u̶b̶c̶o̶n̶d̶i̶t̶i̶o̶n̶s̶.

编辑(感谢 a_guest):当条件由 1 到 5 个子条件组成时,掩码函数表现更好

那么,这两种方法之间是否有任何区别,因为它倾向于在子条件数量上具有相同的趋势。

用于绘制我的数据的函数:

import matplotlib.pyplot as plt

def graph(data):
    t = [int(i) for i in range(1, len(data["mask"]) + 1)]

    plt.xlabel('Number of conditions')
    plt.ylabel('timeit (ms)')
    plt.title('Benchmark mask vs query')
    plt.grid(True)
    plt.plot(t, data["mask"], 'r', label="mask")
    plt.plot(t, data["query"], 'b', label="query")
    plt.xlim(1, len(data["mask"]))
    plt.legend()
    plt.show()

用于创建要被 timeit 测试的条件的函数:

def create_multiple_conditions_mask(columns, nb_conditions, condition):
    mask_list = []
    for i in range(nb_conditions):
        mask_list.append("(df['" + columns[i] + "']" + " " + condition + ")")
    return " & ".join(mask_list)

def create_multiple_conditions_query(columns, nb_conditions, condition):
    mask_list = []
    for i in range(nb_conditions):
        mask_list.append(columns[i] + " " + condition)
    return "'" + " and ".join(mask_list) + "'"

使用包含浮点数的 pandas DataFrame 对屏蔽与查询进行基准测试的函数:

def benchmarks_mask_vs_query(dim_df=(50,10),  labels=[], condition="> 0", random=False):
    # init local variable
    time_results = {"mask": [], "query": []}
    nb_samples, nb_columns = dim_df
    all_labels = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')

    if nb_columns > 26:
        if len(labels) == nb_columns:
            all_labels = labels
        else:
            raise Exception("labels length must match nb_columns" )


    df = pd.DataFrame(np.random.randn(nb_samples, nb_columns), columns=all_labels[:nb_columns])

    for col in range(nb_columns):
        if random:
            condition = "<" + str(np.random.random(1)[0])
        mask = "df[" + create_multiple_conditions_mask(df.columns, col+1, condition) + "]"
        query = "df.query(" + create_multiple_conditions_query(df.columns, col+1, condition) + ")"
        print("Parameters: nb_conditions=" + str(col+1) + ", condition= " + condition)
        print("Mask created: " + mask)
        print("Query created: " + query)
        print()
        result_mask = timeit(mask, number=100, globals=locals()) * 10
        result_query = timeit(query, number=100, globals=locals()) * 10
        time_results["mask"].append(result_mask)
        time_results["query"].append(result_query)
    return time_results

我运行的是:

# benchmark on a DataFrame of shape(50,25) populating with random values
# as well as the conditions ("<random_value")
data = benchmarks_mask_vs_query((50,25), random=True)
graph(data)

我得到什么:

在此处输入图像描述

标签: pythonpandasdataframematplotlib

解决方案


推荐阅读