首页 > 解决方案 > 将函数从 python 转换为 pyspark

问题描述

我想比较两个 pyspark 数据框并在新表中获取差异。

我测试了使用python的方法:

我的数据框

data = {'name': ['NO_VALUE', 'Molly', 'Tina', 'Jake', 'Amy'],
    'year': [2012, -999999, 2013, 2014, 2014],
    'reports': [4, 24, 31, 2, 3]}
df3 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df3

我的参考数据框

data_ref = {'name': ['Jaso', 'Molly', 'Tina', 'Jake', 'Amy'],
    'year': [2012, 202, 2013, 2014, 2014],
    'reports': [4, 24, 31, 2, 3]}
df_ref3 = pd.DataFrame(data_ref, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df_ref3

比较行:

def compare_datasets(df, df_ref):
    ne_stacked = (df != df_ref).stack()
    changed = ne_stacked[ne_stacked]
    changed.index.names = ['id', 'col']
    difference_locations = np.where(df != df_ref)
    changed_from = df.values[difference_locations]
    changed_to = df_ref.values[difference_locations]
    error_test = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)
    return error_test

compare_datasets(df3, df_ref3)

但是,我想在 pyspark 中执行此操作。有人有想法吗?

谢谢!

标签: pythonpysparkpyspark-sql

解决方案


可能很难准确地重现这种行为。我为您提供一种部分解决方案:

df.show()
+----------+--------+-------+-------+
|     index|    name|   year|reports|
+----------+--------+-------+-------+
|   Cochice|NO_VALUE|   2012|      4|
|      Pima|   Molly|-999999|     24|
|Santa Cruz|    Tina|   2013|     31|
|  Maricopa|    Jake|   2014|      2|
|      Yuma|     Amy|   2014|      3|
+----------+--------+-------+-------+

df_ref.show()
+----------+-----+----+-------+
|     index| name|year|reports|
+----------+-----+----+-------+
|   Cochice| Jaso|2012|      4|
|      Pima|Molly|2012|     24|
|Santa Cruz| Tina|2013|     31|
|  Maricopa| Jake|2014|      2|
|      Yuma|  Amy|2014|      3|
+----------+-----+----+-------+

df.subtract(df_ref).show()
+-------+--------+-------+-------+
|  index|    name|   year|reports|
+-------+--------+-------+-------+
|   Pima|   Molly|-999999|     24|
|Cochice|NO_VALUE|   2012|      4|
+-------+--------+-------+-------+

或者你可以做一个慢的:

for col in df_ref.columns:
  out = df.select(col).subtract(df_ref.select(col))
  if out.first():
    print(out.collect())

[Row(name=u'NO_VALUE')]
[Row(year=-999999)]

推荐阅读