首页 > 解决方案 > 如何在 PySpqrk Dataframe 中限制和分区数据

问题描述

我有以下数据

+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|restaurant_id|     restaurant_name|     city|state|postal_code|               stars|review_count|cuisine_name|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+
|        62112|      Neptune Oyster|   Boston|   MA|      02113|4.500000000000000000|        5115|    American|
|        62112|      Neptune Oyster|   Boston|   MA|      02113|4.500000000000000000|        5115|        Thai|
|        60154|Giacomo's Ristora...|   Boston|   MA|      02113|4.000000000000000000|        3520|     Italian|
|        61455|Atlantic Fish Com...|   Boston|   MA|      02116|4.000000000000000000|        2575|    American|
|        57757|      Top of the Hub|   Boston|   MA|      02199|3.500000000000000000|        2273|    American|
|        58631|         Carmelina's|   Boston|   MA|      02113|4.500000000000000000|        2250|     Italian|
|        58895|         The Beehive|   Boston|   MA|      02116|3.500000000000000000|        2184|    American|
|        56517|Lolita Cocina & T...|   Boston|   MA|      02116|4.000000000000000000|        2179|    American|
|        56517|Lolita Cocina & T...|   Boston|   MA|      02116|4.000000000000000000|        2179|     Mexican|
|        58440|                Toro|   Boston|   MA|      02118|4.000000000000000000|        2175|     Spanish|
|        58615|     Regina Pizzeria|   Boston|   MA|      02113|4.000000000000000000|        2071|     Italian|
|        58723|            Gaslight|   Boston|   MA|      02118|4.000000000000000000|        2056|    American|
|        58723|            Gaslight|   Boston|   MA|      02118|4.000000000000000000|        2056|      French|
|        60920|  Modern Pastry Shop|   Boston|   MA|      02113|4.000000000000000000|        2042|     Italian|
|        59453|Gourmet Dumpling ...|   Boston|   MA|      02111|3.500000000000000000|        1990|   Taiwanese|
|        59453|Gourmet Dumpling ...|   Boston|   MA|      02111|3.500000000000000000|        1990|     Chinese|
|        59204|Russell House Tavern|Cambridge|   MA|      02138|4.000000000000000000|        1965|    American|
|        60732|Eastern Standard ...|   Boston|   MA|      02215|4.000000000000000000|        1890|    American|
|        60732|Eastern Standard ...|   Boston|   MA|      02215|4.000000000000000000|        1890|      French|
|        56970|         Border Café|Cambridge|   MA|      02138|4.000000000000000000|        1880|     Mexican|
+-------------+--------------------+---------+-----+-----------+--------------------+------------+------------+

我想根据城市、州和美食对数据进行分区,并按星级排序并查看计数,最后限制每个分区的记录。

这可以用pyspark完成吗?

标签: pysparkapache-spark-sql

解决方案


您可以row_number在窗口化后添加到分区并基于此过滤以限制每个窗口的记录。max_number_of_rows_per_partition您可以使用下面代码中的变量来控制每个窗口的最大行数。

由于您的问题不包括您想要starsreview_count订购的方式,我假设它们是递减的。

import pyspark.sql.functions as F
from pyspark.sql import Window

window_spec = Window.partitionBy("city", "state", "cuisine_name")\
                    .orderBy(F.col("stars").desc(), F.col("review_count").desc())

max_number_of_rows_per_partition = 3

df.withColumn("row_number", F.row_number().over(window_spec))\
  .filter(F.col("row_number") <= max_number_of_rows_per_partition)\
  .drop("row_number")\
  .show(200, False)

推荐阅读