首页 > 解决方案 > DASK:使用掩码时防止用 NaN 覆盖

问题描述

在 Pandas 中,我可以使用 .loc 来执行满足条件的行计算,而不会影响其他行。即,我可以隔离第 1 行,更改列值,并知道第 2 行保持不变。

在 Dask 中,由于其他冲突,我无法将 .loc 与我需要的函数一起使用(通常是“函数未实现”,因为我有一些复杂的公式正在进行),所以我转向了 .mask()作为 .loc() 的替代品

使用 .mask() 会导致未选择的行目标列被 NaN 覆盖(下面的示例)。即,那些满足条件的行被正确计算,但那些不满足条件的任何现有值被替换为NaN。对该列的任何进一步工作都会将任何先前计算的行保留为 NaN。

*

我可以使用/做什么来防止用 NaN 覆盖非选定行中的现有值?

让我们从一个样本开始。“系数”将始终设置为狗。

import pandas as pd
import dask.dataframe as dd

data = {'a':  [1, 12, 15, 20, 0],
        'b': [1, 10, 15, 20, 10],
    'answer': ['Apple','Orange','Pear', 'Banana', 'Carrot']
        }

df = pd.DataFrame (data, columns = ['a','b', 'answer'])
ddf = dd.from_pandas(df,npartitions=1)

ddf['Coefficient'] = 'Dog'

#ddf['Coefficient'] = ddf['answer'].mask((ddf['a'] >= 12) & (ddf['b'] > 10))
#ddf['Coefficient'] = ddf['answer'].mask((ddf['a'] >= 12) & (ddf['b'] <= 10))
# Third conditional mask statement
# Fourth Conditional mask statement
# Nth Conditional mask statement


print(ddf.head())

然后,让我们使用 .mask() 定位行,并根据某些标准将它们更改为水果/蔬菜

import pandas as pd
import dask.dataframe as dd

data = {'a':  [1, 12, 15, 20, 0],
        'b': [1, 10, 15, 20, 10],
    'answer': ['Apple','Orange','Pear', 'Banana', 'Carrot']
        }

df = pd.DataFrame (data, columns = ['a','b', 'answer'])
ddf = dd.from_pandas(df,npartitions=1)

ddf['Coefficient'] = 'Dog'

ddf['Coefficient'] = ddf['answer'].mask((ddf['a'] >= 12) & (ddf['b'] > 10))
ddf['Coefficient'] = ddf['answer'].mask((ddf['a'] >= 12) & (ddf['b'] <= 10))
# Third conditional
# Fourth Conditional
# Nth Conditional

print(ddf.head())

这导致:

    a   b  answer Coefficient
0   1   1   Apple       Apple
1  12  10  Orange         NaN
2  15  15    Pear        Pear
3  20  20  Banana      Banana
4   0  10  Carrot      Carrot

交换两个 .mask() 行会改变 NaN 的位置

import pandas as pd
import dask.dataframe as dd

data = {'a':  [1, 12, 15, 20, 0],
        'b': [1, 10, 15, 20, 10],
    'answer': ['Apple','Orange','Pear', 'Banana', 'Carrot']
        }

df = pd.DataFrame (data, columns = ['a','b', 'answer'])
ddf = dd.from_pandas(df,npartitions=1)

ddf['Coefficient'] = 'Dog'

ddf['Coefficient'] = ddf['answer'].mask((ddf['a'] >= 12) & (ddf['b'] <= 10))
ddf['Coefficient'] = ddf['answer'].mask((ddf['a'] >= 12) & (ddf['b'] > 10))
# Third conditional
# Fourth Conditional
# Nth Conditional

print(ddf.head())

这导致:

    a   b  answer Coefficient
0   1   1   Apple       Apple
1  12  10  Orange      Orange
2  15  15    Pear         NaN
3  20  20  Banana         NaN
4   0  10  Carrot      Carrot

标签: pythondataframedask

解决方案


如果您的掩码/位置要求仅使用一行中的信息(因此不使用类似的.shift()内容),那么您可以使用df.map_partitions(my_func),其中my_func使用pandas语法定义:

def my_func(df):
    mask_1 = ( df['a']>=12 ) & (df['b']<=10 )
    mask_2 = ( df['a']>=12 ) & (df['b']>10 )
    df.loc[mask_1, 'Coefficient'] = df.loc[mask_1, 'answer']
    df.loc[mask_2, 'Coefficient'] = df.loc[mask_2, 'answer']
    return df

ddf.map_partitions(my_func).compute()

推荐阅读