首页 > 解决方案 > StringIndexer NumberFormatException 值在列中不可见

问题描述

这是我要编码的列中的所有不同值。state_msgstring

df.groupBy('state_msg').count().show()
+----------+--------+                                                           
| state_msg|   count|
+----------+--------+
|Redirected|      28|
|      Busy|  164790|
|  Canceled| 1063663|
|  Finished|36100201|
|Terminated|   12982|
|    Failed|  941183|
| Timed out| 5726363|
|     Error| 1957993|
|  Off-line|  186322|
| Not found|  592259|
+----------+--------+

我正在尝试对该列进行一次热编码:

import pyspark.sql.functions as func

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='state_msg', outputCol='state_msg_index')
indexed_df = indexer.fit(df).transform(df)

但是我收到了这个异常,这是没有意义的,因为根据上面那个 groupBy 产生的不同值,"1234567890"它不是一个可能的值。state_msg

    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NumberFormatException: For input string: "1234567890"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:583)
    at java.lang.Integer.parseInt(Integer.java:615)

df.groupBy('state_msg').count().show(n=100)
+----------+--------+
| state_msg|   count|
+----------+--------+
|Redirected|      28|
|      Busy|  165241|
|  Canceled| 1067515|
|  Finished|36270559|
|Terminated|   12997|
|    Failed|  944131|
| Timed out| 5745550|
|     Error| 1959041|
|  Off-line|  186899|
| Not found|  593823|
+----------+--------+

df.agg(countDistinct('state_msg').alias('count')).show()

+-----+
|count|
+-----+
|   10|
+-----+

标签: javapythonapache-sparkpyspark

解决方案


推荐阅读