首页 > 解决方案 > How to sort each row of pandas dataframe and return column index based on sorted values of row

问题描述

I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. I could do it in a slow way. Can anyone suggest improvements using parallelization or vectorized code for this. I have posted an example below.

data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'

# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True) 

# print the first three rows
print(gapminder.head(n=3))

   year         pop  lifeExp   gdpPercap
0  1952   8425333.0   28.801  779.445314
1  1957   9240934.0   30.332  820.853030
2  1962  10267083.0   31.997  853.100710

The result I am looking for is this

tag_0   tag_1   tag_2   tag_3
0   pop year    gdpPercap   lifeExp
1   pop year    gdpPercap   lifeExp
2   pop year    gdpPercap   lifeExp

In this case, since pop is always higher than gdpPercap and lifeExp, it always comes first.

I could achieve the required output by using the following code. But the computation takes longer time if the df has lot of rows/columns.

Can anyone suggest an improvement over this

def sort_df(df):
    sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
    for i in range(df.shape[0]):
        sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
    return sorted_tags

sort_df(gapminder)

标签: pythonpandassorting

解决方案


这可能与使用 numpy 一样快:

def sort_df(df):
    return pd.DataFrame(
        data=df.columns.values[np.argsort(-df.values, axis=1)],
        columns=['tag_{}'.format(i) for i in range(df.shape[1])]
    )

print(sort_df(gapminder.head(3)))

  tag_0 tag_1      tag_2    tag_3
0   pop  year  gdpPercap  lifeExp
1   pop  year  gdpPercap  lifeExp
2   pop  year  gdpPercap  lifeExp

说明:np.argsort按行对值进行排序,但返回排序数组的索引而不是排序值,可用于对数组进行协同排序。减号按降序排列。在您的情况下,您使用索引对列进行排序。numpy 广播负责返回正确的形状。

您的示例的运行时间约为 3 毫秒,而您的函数为 2.5 秒。


推荐阅读