首页 > 解决方案 > Dask - 查找重复值

问题描述

我需要在daskDataFrame 的列中查找重复项。

因为pandas这是有duplicated()方法的。虽然dask不支持它。

问:在 dask 中获取所有重复值的最佳方法是什么?

我的想法: 创建一个我正在检查的列作为索引,drop_duplicates然后join.

有没有更好的解决方案?

例如:

df = pandas.DataFrame(
    [
        ['a'],
        ['b'],
        ['c'],
        ['a']
    ],
    columns=['col']
)
df_test = dask.dataframe.from_pandas(df, npartitions=2)
# Expected to get dataframe with value 'a', as it appears twice

标签: pythonpandasdask

解决方案


我想出了以下解决方案:

import dask.dataframe as dd
import pandas

if __name__ == '__main__':
    df = pandas.DataFrame(
        [
            ['a'],
            ['b'],
            ['c'],
            ['a']
        ],
        columns=["col-a"]
    )
    ddf = dd.from_pandas(df, npartitions=2)

    # Apparently the code below will fail if the dask DataFrame is empty
    if ddf.index.size.compute() != 0:
        # With indexing data will be repartitioned - and all duplicated can be found within one partition
        indexed_df = ddf.set_index('col-a', drop=False)
        # Mark duplicate values within partitions. dask DataFrame does not support duplicates().
        dups = indexed_df.map_partitions(lambda d: d.duplicated())
        # Get duplicated by indexes calculated in previous step.
        duplicates = indexed_df[dups].compute().index.tolist()
        print(duplicates) # Prints: ['a']

这可以进一步改善吗?


推荐阅读