python - How to sort each row of pandas dataframe and return column index based on sorted values of row
问题描述
I am trying to sort each row of pandas dataframe and get the index of sorted values in a new dataframe. I could do it in a slow way. Can anyone suggest improvements using parallelization or vectorized code for this. I have posted an example below.
data_url = 'https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True)
# print the first three rows
print(gapminder.head(n=3))
year pop lifeExp gdpPercap
0 1952 8425333.0 28.801 779.445314
1 1957 9240934.0 30.332 820.853030
2 1962 10267083.0 31.997 853.100710
The result I am looking for is this
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
In this case, since pop
is always higher than gdpPercap
and lifeExp
, it always comes first.
I could achieve the required output by using the following code. But the computation takes longer time if the df
has lot of rows/columns.
Can anyone suggest an improvement over this
def sort_df(df):
sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
for i in range(df.shape[0]):
sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
return sorted_tags
sort_df(gapminder)
解决方案
这可能与使用 numpy 一样快:
def sort_df(df):
return pd.DataFrame(
data=df.columns.values[np.argsort(-df.values, axis=1)],
columns=['tag_{}'.format(i) for i in range(df.shape[1])]
)
print(sort_df(gapminder.head(3)))
tag_0 tag_1 tag_2 tag_3
0 pop year gdpPercap lifeExp
1 pop year gdpPercap lifeExp
2 pop year gdpPercap lifeExp
说明:np.argsort
按行对值进行排序,但返回排序数组的索引而不是排序值,可用于对数组进行协同排序。减号按降序排列。在您的情况下,您使用索引对列进行排序。numpy 广播负责返回正确的形状。
您的示例的运行时间约为 3 毫秒,而您的函数为 2.5 秒。
推荐阅读
- ansible - Ansible 抱怨“需要 MySQL-python 模块”
- rally - Rally API 日期过滤器不起作用
- python - Evernote 访问令牌 Python
- cuda - nvlink 可以从单独的编译单元中内联设备功能吗?
- node.js - 切换到 Docker 后 Node 的缓存问题?
- html - 将文本元素的宽度继承给换行 HTML
- sql - 我们如何使用 XML 文件识别来自数据库中哪些表和列的值
- php - laravel 从数据库中检索数据需要太多时间
- c# - 如何以 M YYMMDD SS 方式获取日期
- apache-spark - Spark Yarn 模式不起作用在执行程序阶段引发 java Null Pointer 异常