首页 > 解决方案 > pandas: assign value based on equivalent value in another row with lookup

问题描述

I have a dataframe with this structure:

df = pd.DataFrame([
    { "state": "CA", region="West", "total" 2, "week": 10 },
    { "state": "UT", region="Midwest", "total" 7, "week": 10 },
    { "state": "CA", region="CA", "total" 14, "week": 11 },
    { "state": "UT", region="UT", "total" 18, "week": 11 },
    { "state": "CA", region="West", "total" 21, "week": 12 },
    { "state": "UT", region="Midwest", "total" 30, "week": 12 }
])

(My real dataset has all the states in!)

All the region values for rows with week set to 11 have been set wrongly, so I would like to replace them with the value for the same state in week 10.

I tried this:

df[df.week == 11].region = df[df.week == 10].region

But it's reliant on the rows being in the same order, which they might not be. Is there a way I can do this while matching on state name?

标签: pythonpandas

解决方案


the code to create your dataframe was messed up, so I couldn't test the following, but you can:

  1. merge a subset of the dataframe back onto itself with the defined criteria merging on state and week (order does not matter since you are using a merge)
  2. use np.where() to update the relevant values using the new region_x column to updated region where appropriate
  3. then, you can just drop region_x

df = df.merge(df.loc[df['week'] == 11, ['state','region','week']].replace(10,11),
             how='left',
             on=['state', 'week'], suffixes=('', '_x'))
df['region'] = np.where(df['region'] == 11, df['region_x'], df['region'])
df = df.drop('region_x', axis=1)

If you don't mind sorting your dataframe you could also use ffill(). Make sure your 'region' column doesn't already have NaN values though. In case it does, I included this line of code: df['region'] = df['region'].replace(np.nan, 'Unknown') so that unintended NaN values aren't forward -filled unintentionally, but perhaps you don't need that line of code:

df = df.sort_values(['state', 'week'])
df['region'] = df['region'].replace(np.nan, 'N/A') #optional
df['region'] = np.where(df['week'] == 11, np.nan, df['region'])
df['region'] = df['region'].ffill()

推荐阅读