首页 > 解决方案 > Interpolate a DataFrame column and sort based on another column in PySpark or Pandas

问题描述

Given the following DataFrame we need to interpolate my_column values from the example and use them as separate columns and then sort by the int_column values that belong to each some_id column in descending order. The example:

+--------------------+-----------+------------------+
|          some_id   | my_column |      int_column  |
+--------------------+-----------+------------------+
|xx1                 |id_1       |           3      |
|xx1                 |id_2       |           4      |
|xx1                 |id_3       |           5      |
|xx2                 |id_1       |           6      |
|xx2                 |id_2       |           1      |
|xx2                 |id_3       |           3      |
|xx3                 |id_1       |           4      |
|xx3                 |id_2       |           8      |
|xx3                 |id_3       |           9      |
|xx4                 |id_1       |           1      |
+--------------------+-----------+------------------+

Expected output:

+--------------------+-----------+------------------+
|          id_1      | id_2      |      id_3        |
+--------------------+-----------+------------------+
| [xx4, 1]           |[xx2, 1]   |[xx2, 3]          |
| [xx1, 3]           |[xx1, 4]   |[xx1, 5]          |
| [xx3, 4]           |[xx3, 8]   |[xx3, 9]          |
| [xx2, 6]           |null       |null              |
+--------------------+-----------+------------------+

As you can see, for id_1 the lowest number in int_column is 1 right at the end of the DataFrame and it belongs to xx4 from the some_id column, the next value is 3, 4, and 6, each belonging to xx1, xx3, and xx2 respectively.

Any pointers on how to approach this problem? Either PySpark or Pandas can be used.

Code to reproduce the input dataframe:

import pandas as pd

data = {'some_id': ['xx1', 'xx1', 'xx1', 'xx2', 'xx2', 'xx2', 'xx3', 'xx3', 'xx3', 'xx4'], \
        'my_column' : ['id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1'],\
       'int_column' : [3, 4, 5, 6 , 1, 3, 4, 8, 9, 1]}

df = pd.DataFrame.from_dict(data)

标签: pythonpandasapache-sparkdataframepyspark

解决方案


我们需要一个辅助键,通过 using 创建cumcount,然后我们使用groupby+ apply(这部分就像pivot,或者你可以使用pivot_tableor crosstab

df=df.assign(key=df.groupby('my_column').cumcount())
df.groupby(['key','my_column']).apply(lambda x : list(zip(x['some_id'],x['int_column']))[0]).unstack()
Out[378]: 
my_column      id_1      id_2      id_3
key                                    
0          (xx1, 3)  (xx1, 4)  (xx1, 5)
1          (xx2, 6)  (xx2, 1)  (xx2, 3)
2          (xx3, 4)  (xx3, 8)  (xx3, 9)
3          (xx4, 1)      None      None

如果使用pivot+sort_values

df=df.sort_values('int_column').assign(key=df.groupby('my_column').cumcount())
df['Value']=list(zip(df['some_id'],df['int_column']))
s=df.pivot(index='key',columns='my_column',values='Value')
s
Out[397]: 
my_column      id_1      id_2      id_3
key                                    
0          (xx4, 1)  (xx2, 1)  (xx2, 3)
1          (xx1, 3)  (xx1, 4)  (xx1, 5)
2          (xx3, 4)  (xx3, 8)  (xx3, 9)
3          (xx2, 6)      None      None

推荐阅读