首页 > 解决方案 > 导致 IndexError 的类似字符串

问题描述

我有一个pandas df包含各种功能和时间戳的。我正在尝试有效地返回不同功能之间的差异。

这是一个非常小的样本dfCol C表示函数,B显示时间戳,D显示不同的地方,E显示出现的数量。本质上,我想返回不同位置的函数之间的差异。这些功能多次出现。

df = pd.DataFrame({          
    'B' : [10,20,35,50],
    'C' : ['Stop','Close','Open','Finish'],
    'D' : ['Home','Home Kitchen','Home','Home'],          
    'E' : [1,1,1,1],          
    })

我目前正在通过以下方式执行此操作:

def f(g):
    Stop = g.loc[df['C'] == 'Stop', 'B']
    Finish = g.loc[df['C'] == 'Finish', 'B']
    Open = g.loc[df['C'] == 'Open', 'B']
    g['YX_diff'] = Finish.values[0] - Stop.values[0]
    g['YZ_diff'] = Finish.values[0] - Open.values[0]

    return (g)

我有一个执行此循环的位置列表。上面的 df 只显示 Home 但它可以是很多地方。为了应用这一点,我包括以下内容:

included = ['Home']

df = df[df.D.isin(included)].groupby(['D', 'E']).apply(f)

我遇到的问题是我想看的地方。具体来说,如果字符串相似。例如:

included = ['Home']

工作正常。但如果我包括

included = ['Home','Home Kitchen']

它返回一个错误:

    g['YX_diff'] = Finish.values[0] - Stop.values[0]

IndexError: index 0 is out of bounds for axis 0 with size 0

我不想更改字符串,因为它们代表特定信息。我不确定我还能做什么?

标签: pythonpandasloopsdataframe

解决方案


字符串存在问题,Home Kitchen所有 3 个过滤Series的都是空的,因此无法选择第一个值。

s = pd.Series()
print (s)
Series([], dtype: float64)

print (s.values[0])

IndexError:索引 0 超出轴 0 的范围,大小为 0

你可以检查它:

def f(g):
    Stop = g.loc[df['C'] == 'Stop', 'B']
    Finish = g.loc[df['C'] == 'Finish', 'B']
    Open = g.loc[df['C'] == 'Open', 'B']
    print (Stop)
    print (Finish)
    print (Open)
#    g['YX_diff'] = Finish.values[0] - Stop.values[0]
#    g['YZ_diff'] = Finish.values[0] - Open.values[0]

    return (g)

included = ['Home', 'Home Kitchen']

df = df[df.D.isin(included)].groupby(['D', 'E']).apply(f)

0    10
Name: B, dtype: int64
3    50
Name: B, dtype: int64
2    35
Name: B, dtype: int64
0    10
Name: B, dtype: int64
3    50
Name: B, dtype: int64
2    35
Name: B, dtype: int64
Series([], Name: B, dtype: int64)
Series([], Name: B, dtype: int64)
Series([], Name: B, dtype: int64)

可能的解决方案是if-else针对这些字符串 - 例如设置为NaNs:

def f(g):
    Stop = g.loc[df['C'] == 'Stop', 'B']
    Finish = g.loc[df['C'] == 'Finish', 'B']
    Open = g.loc[df['C'] == 'Open', 'B']
    Stop = np.nan if len(Stop) == 0 else Stop.values[0]
    Finish = np.nan if len(Finish) == 0 else Finish.values[0]
    Open = np.nan if len(Open) == 0 else Open.values[0]

    g['YX_diff'] = Finish - Stop
    g['YZ_diff'] = Finish - Open

    return (g)

included = ['Home', 'Home Kitchen']

df = df[df.D.isin(included)].groupby(['D', 'E']).apply(f)
print (df)
    B       C             D  E  YX_diff  YZ_diff
0  10    Stop          Home  1     40.0     15.0
1  20   Close  Home Kitchen  1      NaN      NaN
2  35    Open          Home  1     40.0     15.0
3  50  Finish          Home  1     40.0     15.0

纯 python 中的另一种解决方案next具有可选参数,即NaN,如果没有要提取的元素:

def f(g):
    Stop = g.loc[df['C'] == 'Stop', 'B']
    Finish = g.loc[df['C'] == 'Finish', 'B']
    Open = g.loc[df['C'] == 'Open', 'B']

    Stop_first = next(iter(Stop), np.nan)
    Finish_first = next(iter(Finish), np.nan)
    Open_first = next(iter(Open), np.nan)

    g['YX_diff'] = Finish_first - Stop_first
    g['YZ_diff'] = Finish_first - Open_first

    return (g)

included = ['Home', 'Home Kitchen']

df = df[df.D.isin(included)].groupby(['D', 'E']).apply(f)

推荐阅读