首页 > 解决方案 > Quickest way to assign cell values in Pandas

问题描述

I have a list of tuples:

d = [("a", "x"), ("b", "y"), ("a", "y")]

and the DataFrame:

     y    x
b  0.0  0.0
a  0.0  0.0

I would like to replace any 0s with 1s if the row and column labels correspond to a tuple in d, such that the new DataFrame is:

     y    x
b  1.0  0.0
a  1.0  1.0

I am currently using:

for i, j in d:
    df.loc[i, j] = 1.0

This seems to me as the most "pythonic" approach but for a DataFrame of shape 20000 * 20000 and a list of length 10000, this process literally takes forever. There must be a better way of accomplishing this. Any ideas?

Thanks

标签: pythonpandasnumpy

解决方案


Approach #1: No bad entries in d

Here's one NumPy based method -

def assign_val(df, d, newval=1):    
    # Get d-rows,cols as arrays for efficient usage latet on
    di,dc = np.array([j[0]  for j in d]), np.array([j[1]  for j in d])

    # Get col and index data
    i,c = df.index.values.astype(di.dtype),df.columns.values.astype(dc.dtype)

    # Locate row indexes from d back to df
    sidx_i = i.argsort()
    I = sidx_i[np.searchsorted(i,di,sorter=sidx_i)]

    # Locate column indexes from d back to df
    sidx_c = c.argsort()
    C = sidx_c[np.searchsorted(c,dc,sorter=sidx_c)]

    # Assign into array data with new values
    df.values[I,C] = newval
    # Use df.to_numpy(copy=False)[I,C] = newval on newer pandas versions
    return df

Sample run -

In [21]: df = pd.DataFrame(np.zeros((2,2)), columns=['y','x'], index=['b','a'])

In [22]: d = [("a", "x"), ("b", "y"), ('a','y')]

In [23]: assign_val(df, d, newval=1)
Out[23]: 
     y    x
b  1.0  0.0
a  1.0  1.0

Approach #2: Generic one

If there are any bad entries in `d, we need to filter out those. So, a modified one for that generic case would be -

def ssidx(i,di):
    sidx_i = i.argsort()
    idx_i = np.searchsorted(i,di,sorter=sidx_i)
    invalid_mask = idx_i==len(sidx_i)
    idx_i[invalid_mask] = 0
    I = sidx_i[idx_i]    
    invalid_mask |= i[I]!=di
    return I,invalid_mask

# Get d-rows,cols as arrays for efficient usage latet on
di,dc = np.array([j[0]  for j in d]), np.array([j[1]  for j in d])

# Get col and index data
i,c = df.index.values.astype(di.dtype),df.columns.values.astype(dc.dtype)

# Locate row indexes from d back to df
I,badmask_I = ssidx(i,di)

# Locate column indexes from d back to df
C,badmask_C = ssidx(c,dc)

badmask = badmask_I | badmask_C

goodmask = ~badmask

df.values[I[goodmask],C[goodmask]] = newval

推荐阅读