首页 > 解决方案 > 根据字典的值将字典列表的值分配给数据帧,具体取决于数据帧的列值

问题描述

我想将取决于字典字段(localSymbol_aux)的字典列表的值分配给数据帧,具体取决于火花结构化流(pyspark)中数据帧(localSymbol)的列值

我有以下字典列表

level_1_batch_list_of_dict = [
{
'localSymbol_aux': 'EUR.USD', 
'level_1_precio_max_sesion_aux': 4,
'level_1_precio_min_sesion_aux': 0
},
{
'localSymbol_aux': 'USD.JPY', 
'level_1_precio_max_sesion_aux': 6,
'level_1_precio_min_sesion_aux': 0
}
]

我有以下数据框:

+------------------------------------------+-----------+-----------------+
|ventana                                   |localSymbol|precio_max_sesion|
+------------------------------------------+-----------+-----------------+
|{2021-05-10 11:33:30, 2021-05-10 11:34:00}|EUR.USD    |0                |
|{2021-05-10 11:34:00, 2021-05-10 11:34:30}|EUR.USD    |0                |
|{2021-05-10 11:34:30, 2021-05-10 11:35:00}|EUR.USD    |0                |
|{2021-05-10 11:35:00, 2021-05-10 11:35:30}|EUR.USD    |0                |
|{2021-05-10 11:35:30, 2021-05-10 11:36:00}|EUR.USD    |0                |
|{2021-05-10 11:36:00, 2021-05-10 11:36:30}|EUR.USD    |0                |
|{2021-05-10 11:33:30, 2021-05-10 11:34:00}|USD.JPY    |0                |
|{2021-05-10 11:34:00, 2021-05-10 11:34:30}|USD.JPY    |0                |
|{2021-05-10 11:34:30, 2021-05-10 11:35:00}|USD.JPY    |0                |
|{2021-05-10 11:35:00, 2021-05-10 11:35:30}|USD.JPY    |0                |
|{2021-05-10 11:35:30, 2021-05-10 11:36:00}|USD.JPY    |0                |
|{2021-05-10 11:36:00, 2021-05-10 11:36:30}|USD.JPY    |0                |
+------------------------------------------+-----------+-----------------+

但我想要以下内容:

+------------------------------------------+-----------+-----------------+
|ventana                                   |localSymbol|precio_max_sesion|
+------------------------------------------+-----------+-----------------+
|{2021-05-10 11:33:30, 2021-05-10 11:34:00}|EUR.USD    |4                |
|{2021-05-10 11:34:00, 2021-05-10 11:34:30}|EUR.USD    |4                |
|{2021-05-10 11:34:30, 2021-05-10 11:35:00}|EUR.USD    |4                |
|{2021-05-10 11:35:00, 2021-05-10 11:35:30}|EUR.USD    |4                |
|{2021-05-10 11:35:30, 2021-05-10 11:36:00}|EUR.USD    |4                |
|{2021-05-10 11:36:00, 2021-05-10 11:36:30}|EUR.USD    |4                |
|{2021-05-10 11:33:30, 2021-05-10 11:34:00}|USD.JPY    |6                |
|{2021-05-10 11:34:00, 2021-05-10 11:34:30}|USD.JPY    |6                |
|{2021-05-10 11:34:30, 2021-05-10 11:35:00}|USD.JPY    |6                |
|{2021-05-10 11:35:00, 2021-05-10 11:35:30}|USD.JPY    |6                |
|{2021-05-10 11:35:30, 2021-05-10 11:36:00}|USD.JPY    |6                |
|{2021-05-10 11:36:00, 2021-05-10 11:36:30}|USD.JPY    |6                |
+------------------------------------------+-----------+-----------------+

有任何想法吗?

谢谢!!

标签: pythonpysparkspark-structured-streaming

解决方案


您可以将上面的字典转换为 PySpark Dataframe 然后做一个简单的连接

# this is the reference dataframe
ref = spark.createDataFrame(level_1_batch_list_of_dict)
ref.show(10, False)
# +-----------------------------+-----------------------------+---------------+
# |level_1_precio_max_sesion_aux|level_1_precio_min_sesion_aux|localSymbol_aux|
# +-----------------------------+-----------------------------+---------------+
# |4                            |0                            |EUR.USD        |
# |6                            |0                            |USD.JPY        |
# +-----------------------------+-----------------------------+---------------+

# this is your main dataframe
df.show(2, False)
# +------------------------------------------+-----------+-----------------+
# |ventana                                   |localSymbol|precio_max_sesion|
# +------------------------------------------+-----------+-----------------+
# |{2021-05-10 11:33:30, 2021-05-10 11:34:00}|EUR.USD    |0                |
# |{2021-05-10 11:34:00, 2021-05-10 11:34:30}|EUR.USD    |0                |
# +------------------------------------------+-----------+-----------------+

# join them together and get correct values
(df
    .join(ref, on=[df.localSymbol == ref.localSymbol_aux])
    .select(df.ventana, df.localSymbol, ref.level_1_precio_max_sesion_aux.alias('precio_max_sesion'))
    .show(100, False)
)
# +------------------------------------------+-----------+-----------------+
# |ventana                                   |localSymbol|precio_max_sesion|
# +------------------------------------------+-----------+-----------------+
# |{2021-05-10 11:33:30, 2021-05-10 11:34:00}|EUR.USD    |4                |
# |{2021-05-10 11:34:00, 2021-05-10 11:34:30}|EUR.USD    |4                |
# |{2021-05-10 11:34:30, 2021-05-10 11:35:00}|EUR.USD    |4                |
# |{2021-05-10 11:35:00, 2021-05-10 11:35:30}|EUR.USD    |4                |
# |{2021-05-10 11:35:30, 2021-05-10 11:36:00}|EUR.USD    |4                |
# |{2021-05-10 11:36:00, 2021-05-10 11:36:30}|EUR.USD    |4                |
# |{2021-05-10 11:33:30, 2021-05-10 11:34:00}|USD.JPY    |6                |
# |{2021-05-10 11:34:00, 2021-05-10 11:34:30}|USD.JPY    |6                |
# |{2021-05-10 11:34:30, 2021-05-10 11:35:00}|USD.JPY    |6                |
# |{2021-05-10 11:35:00, 2021-05-10 11:35:30}|USD.JPY    |6                |
# |{2021-05-10 11:35:30, 2021-05-10 11:36:00}|USD.JPY    |6                |
# |{2021-05-10 11:36:00, 2021-05-10 11:36:30}|USD.JPY    |6                |
# +------------------------------------------+-----------+-----------------+

推荐阅读