python - Interpolate a DataFrame column and sort based on another column in PySpark or Pandas
问题描述
Given the following DataFrame we need to interpolate my_column
values from the example and use them as separate columns and then sort by the int_column
values that belong to each some_id
column in descending order. The example:
+--------------------+-----------+------------------+
| some_id | my_column | int_column |
+--------------------+-----------+------------------+
|xx1 |id_1 | 3 |
|xx1 |id_2 | 4 |
|xx1 |id_3 | 5 |
|xx2 |id_1 | 6 |
|xx2 |id_2 | 1 |
|xx2 |id_3 | 3 |
|xx3 |id_1 | 4 |
|xx3 |id_2 | 8 |
|xx3 |id_3 | 9 |
|xx4 |id_1 | 1 |
+--------------------+-----------+------------------+
Expected output:
+--------------------+-----------+------------------+
| id_1 | id_2 | id_3 |
+--------------------+-----------+------------------+
| [xx4, 1] |[xx2, 1] |[xx2, 3] |
| [xx1, 3] |[xx1, 4] |[xx1, 5] |
| [xx3, 4] |[xx3, 8] |[xx3, 9] |
| [xx2, 6] |null |null |
+--------------------+-----------+------------------+
As you can see, for id_1
the lowest number in int_column
is 1 right at the end of the DataFrame and it belongs to xx4
from the some_id
column, the next value is 3, 4, and 6, each belonging to xx1, xx3, and xx2 respectively.
Any pointers on how to approach this problem? Either PySpark or Pandas can be used.
Code to reproduce the input dataframe:
import pandas as pd
data = {'some_id': ['xx1', 'xx1', 'xx1', 'xx2', 'xx2', 'xx2', 'xx3', 'xx3', 'xx3', 'xx4'], \
'my_column' : ['id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1'],\
'int_column' : [3, 4, 5, 6 , 1, 3, 4, 8, 9, 1]}
df = pd.DataFrame.from_dict(data)
解决方案
我们需要一个辅助键,通过 using 创建cumcount
,然后我们使用groupby
+ apply
(这部分就像pivot
,或者你可以使用pivot_table
or crosstab
)
df=df.assign(key=df.groupby('my_column').cumcount())
df.groupby(['key','my_column']).apply(lambda x : list(zip(x['some_id'],x['int_column']))[0]).unstack()
Out[378]:
my_column id_1 id_2 id_3
key
0 (xx1, 3) (xx1, 4) (xx1, 5)
1 (xx2, 6) (xx2, 1) (xx2, 3)
2 (xx3, 4) (xx3, 8) (xx3, 9)
3 (xx4, 1) None None
如果使用pivot
+sort_values
df=df.sort_values('int_column').assign(key=df.groupby('my_column').cumcount())
df['Value']=list(zip(df['some_id'],df['int_column']))
s=df.pivot(index='key',columns='my_column',values='Value')
s
Out[397]:
my_column id_1 id_2 id_3
key
0 (xx4, 1) (xx2, 1) (xx2, 3)
1 (xx1, 3) (xx1, 4) (xx1, 5)
2 (xx3, 4) (xx3, 8) (xx3, 9)
3 (xx2, 6) None None
推荐阅读
- sql - 将包含 20.000 个字符的字符串存储到 SQL 中的变量中
- git - 从 git diff 中找出更改的行号
- android - Kotlin,类型不匹配所需的布尔找到类
*问题本身已修复* - amazon-web-services - EC2 启动和停止自动化
- python - 对于这个具有随机初始化权重的未经训练的神经网络,所有输入(或多或少)给出相同的输出
- bash - 使用 bash、Azure CLI 和 REST API 访问 CosmosDB - 如何正确获取令牌和哈希?
- python - 使用正则表达式或模式匹配将字符串规则转换为 python 中的数据框
- reactjs - 将用户导航到传单地图中加载的 geojson 对象
- machine-learning - Lasso 或 Ridge 选择了哪些类型的变量?
- machine-learning - CNN 模型的损失不收敛