首页 > 解决方案 > '如果在 UDF 内部'

问题描述

我有一张这样的桌子:

+----------------------+
|Country_state         |
+----------------------+
| Virginia             |
| New Jersey           |
| British Columbia     |
|Over the North Sea    |
| Germany              |
| Belgium              |
| Germany              |
| Bulgeria             |
| England              |
| England              |
| Germany              |
| England              |
| Belgium              |
...

我需要得到国家所以我写了简单的udf:

def USA(co):
    states = ["Alaska", "Alabama", "Arkansas", "American Samoa", "Arizona", "California", "Colorado", "Connecticut", "District ", "of Columbia", "Delaware", "Florida", "Georgia", "Guam", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", "New York", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Virginia", "Virgin Islands", "Vermont", "Washington", "Wisconsin", "West Virginia", "Wyoming"]

    if co in states:
        return "USA"
    else:
        return co

但这似乎总是错误的,我不知道为什么。

这就是我所说的:

usa = udf(USA, StringType())
finalCountry = c. withColumn("CountryFINAL", usa(c.Country_state))

标签: pythonpyspark

解决方案


不需要UDF,使用.isinwhen/otherwise

from pyspark.sql import functions as F

states = ["Alaska", "Alabama", "Arkansas", "American Samoa", "Arizona", "California", "Colorado", "Connecticut", "District ", "of Columbia", "Delaware", "Florida", "Georgia", "Guam", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", "New York", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Virginia", "Virgin Islands", "Vermont", "Washington", "Wisconsin", "West Virginia", "Wyoming"]

df.withColumn("countryFINAL", F.when(F.col("Country_state").isin(states), F.lit("USA"))\
                               .otherwise(F.col("Country_state"))).show()

#+------------------+------------------+
#|     Country_state|      countryFINAL|
#+------------------+------------------+
#|          Virginia|               USA|
#|        New Jersey|               USA|
#|  British Columbia|  British Columbia|
#|Over the North Sea|Over the North Sea|
#|           Germany|           Germany|
#|           Belgium|           Belgium|
#|           Germany|           Germany|
#|          Bulgeria|          Bulgeria|
#|           England|           England|
#|           England|           England|
#|           Germany|           Germany|
#|           England|           England|
#|           Belgium|           Belgium|
#+------------------+------------------+

推荐阅读