python - Pyspark: Is there a way to create a Summary table (or dataframe) by merging multiple rows into one?
问题描述
I have created the following dataframe by parsing mulitple CSVs in spark. I need to group the average sales of each month per-city per-SKU per-year.
<table><tbody><tr><th>city</th><th>sku_id</th><th>year</th><th>month</th><th>avg_sales</th></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>Jan</td><td>100</td></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>Feb</td><td>120</td></tr><tr><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td></tr><tr><td>Z</td><td>SKU100</td><td>2019</td><td>Dec</td><td>99</td></tr></tbody></table>
Desired output:
<table><tbody><tr><th>city</th><th>sku_id</th><th>year</th><th>Jan_avg_sales</th><th>Feb_avg_sales</th><th>..</th><th>Dec_avg_sales</th></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>100</td><td>120</td><td>..</td><td>320</td></tr><tr><td>A</td><td>SKU1</td><td>2017</td><td>98</td><td>118</td><td>..</td><td>318</td></tr><tr><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td><td>..</td></tr><tr><td>Z</td><td>SKU100</td><td>2019</td><td>99</td><td>114</td><td>..</td><td>314</td></tr></tbody></table>
I have implemented the summary table creation using python dictionary, but i'm not convinced with the solution.
Here is the code snippet i tried so far: path = "s3a://bucket/city1*" cleaned_df = spark.read.format('csv').options(header='true', inferSchema='true').load(path) cleaned_df = cleaned_df.groupby(['Year','city','sku_id']).mean() cleaned_df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("mydata4csv")
解决方案
If you have a dataframe that looks like:
avg_sales city sku_id year
0 300 A sku1 2017
1 210 A sku1 2018
2 200 A sku2 2017
3 10 A sku2 2017
4 10 B sku1 2017
5 190 B sku1 2017
6 130 B sku2 2017
7 130 B sku2 2017
8 50 C sku2 2017
Then you can do:
dataframe.groupby(['year', 'city', 'sku']).mean()
And get:
avg_sales
city sku_id year
A sku1 2017 300
2018 210
sku2 2017 105
B sku1 2017 100
sku2 2017 130
C sku2 2017 50
If you share your python code I can touch up the answer to fit your case.
推荐阅读
- android - Android - 当用户最小化应用程序时使指纹身份验证无效,但在跨活动移动时无效
- wordpress - 更改 wp 属性的侧边栏
- sql - postgresql unnest 和 pivot int 数组列
- python - 通过 pygsheets/python 关闭过滤器
- python - 如何使 python 库可配置?(初始化)
- html - Reactjs 子级的唯一标识符
- variables - asm文件之间的变量导出
- python - Azure Blob 存储错误:指定的资源不存在
- javascript - ExtJS Ex.Msg 添加额外的监听器
- python - 在没有互联网连接的 Linux 上安装 scikit-multilearn 0.2.0