首页 > 解决方案 > VectorIndexer 或 OneHotEncoder 用于分类变量?

问题描述

在处理分类变量作为 Spark 中的 ML 算法的输入时,对 VectorIndexer 或 OneHotEncoder 的使用有些困惑。是不是当我需要知道 ML 输出中每个分类级别的效果时,我需要使用 OneHotEncoder 而在其他情况下可以使用 VectorIndexer?

示例如图所示:

from pyspark.ml.feature import OneHotEncoder, VectorAssembler , VectorIndexer

df = sqlContext.createDataFrame([
    (0.0, 3.0, 3.8),
    (1.0, 0.0, 6.7),
    (2.0, 3.0, 3.3),
    (0.0, 2.0, 1.2),
    (0.0, 1.0, 7.8),
    (2.0, 0.0, 4.4)
], ["category1", "category2","readings"])

encoder = OneHotEncoder(dropLast = True, inputCols=["category1", "category2"],
                        outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()


+---------+---------+--------+-------------+-------------+
|category1|category2|readings| categoryVec1| categoryVec2|
+---------+---------+--------+-------------+-------------+
|      0.0|      3.0|     3.8|(2,[0],[1.0])|    (3,[],[])|
|      1.0|      0.0|     6.7|(2,[1],[1.0])|(3,[0],[1.0])|
|      2.0|      3.0|     3.3|    (2,[],[])|    (3,[],[])|
|      0.0|      2.0|     1.2|(2,[0],[1.0])|(3,[2],[1.0])|
|      0.0|      1.0|     7.8|(2,[0],[1.0])|(3,[1],[1.0])|
|      2.0|      0.0|     4.4|    (2,[],[])|(3,[0],[1.0])|
+---------+---------+--------+-------------+-------------+


va = VectorAssembler(inputCols = df.columns , outputCol = 'features')
assembled = va.transform(df)
idx = VectorIndexer(inputCol = 'features', outputCol = 'features_indexed', maxCategories = 4)
idx_model = idx.fit(assembled)
transformed = idx_model.transform(assembled)
transformed.show()

+---------+---------+--------+-------------+----------------+
|category1|category2|readings|     features|features_indexed|
+---------+---------+--------+-------------+----------------+
|      0.0|      3.0|     3.8|[0.0,3.0,3.8]|   [0.0,3.0,3.8]|
|      1.0|      0.0|     6.7|[1.0,0.0,6.7]|   [1.0,0.0,6.7]|
|      2.0|      3.0|     3.3|[2.0,3.0,3.3]|   [2.0,3.0,3.3]|
|      0.0|      2.0|     1.2|[0.0,2.0,1.2]|   [0.0,2.0,1.2]|
|      0.0|      1.0|     7.8|[0.0,1.0,7.8]|   [0.0,1.0,7.8]|
|      2.0|      0.0|     4.4|[2.0,0.0,4.4]|   [2.0,0.0,4.4]|
+---------+---------+--------+-------------+----------------+

idx_model.categoryMaps

{0: {0.0: 0, 1.0: 1, 2.0: 2}, 1: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3}}

标签: pyspark

解决方案


To my understanding, OneHotEncoder applies only to numerical columns. If your categorical variable is StringType, then you need to pass it through StringIndexer first before you can apply OneHotEncoder.
StringIndexer transforms the labels into numbers, then OneHotEncoder creates the coded column for each value.
The way Spark outputs results of OneHotEncoder is unintuitive, the docs says in Notes section:

This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.

If your categorical column is a Vector or an Array of Strings, then you would use VectorIndexer, then OneHotEncoder. Specifically, you can use VectorIndexer on your "features" column. Here's a similar question.

You need to fill in the nulls first in your categorical columns.
In PySpark, that's df.na.fill("value", subset=["col1","col2",...]).
In Scala, that's df.na.fill("value", Seq("col1","col2",...))

Here's the full application example,

dummydata= [
  (1,"John","B.A.",20,"Male"),
  (2,"Martha","B.Com.",None,"Female"),
  (3,"Mona","B.Com.",21,"Female"),
  (4,"Harish","B.Sc.",22,"Male"),
  (5,"Sam",None,35,"Male"),
  (6,"Jonny","B.A.",22,"Male"),
  (7,"Maria","B.A.",None,"Female"),
  (8,None,"B.A.",25,"Male"),
  (9,"Monalisa","B.A.",21,"Female")
]

toydf= spark.createDataFrame(data = dummydata, schema = ["id", "name", "qualification", "age", "gender"])

toydf.show()
+---+--------+-------------+----+------+
| id|    name|qualification| age|gender|
+---+--------+-------------+----+------+
|  1|    John|         B.A.|  20|  Male|
|  2|  Martha|       B.Com.|null|Female|
|  3|    Mona|       B.Com.|  21|Female|
|  4|  Harish|        B.Sc.|  22|  Male|
|  5|     Sam|         null|  35|  Male|
|  6|   Jonny|         B.A.|  22|  Male|
|  7|   Maria|         B.A.|null|Female|
|  8|    null|         B.A.|  25|  Male|
|  9|Monalisa|         B.A.|  21|Female|
+---+--------+-------------+----+------+

toydf= toydf\
.na.fill("NA", subset=["name","qualification"])\

toydf.show()
+---+--------+-------------+----+------+
| id|    name|qualification| age|gender|
+---+--------+-------------+----+------+
|  1|    John|         B.A.|  20|  Male|
|  2|  Martha|       B.Com.|null|Female|
|  3|    Mona|       B.Com.|  21|Female|
|  4|  Harish|        B.Sc.|  22|  Male|
|  5|     Sam|           NA|  35|  Male|
|  6|   Jonny|         B.A.|  22|  Male|
|  7|   Maria|         B.A.|null|Female|
|  8|      NA|         B.A.|  25|  Male|
|  9|Monalisa|         B.A.|  21|Female|
+---+--------+-------------+----+------+
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer, VectorIndexer

indexer_1= StringIndexer(inputCols= ["qualification"], outputCols=["qual_index"], handleInvalid='keep', stringOrderType='frequencyDesc')

ohe_1= OneHotEncoder(inputCols=["qual_index"], outputCols=["qual_coded"], handleInvalid='keep',dropLast=True)

toydf= indexer_1.fit(toydf).transform(toydf)
toydf= ohe_1.fit(toydf).transform(toydf)

toydf.show()
+---+--------+-------------+----+------+----------+-------------+
| id|    name|qualification| age|gender|qual_index|   qual_coded|
+---+--------+-------------+----+------+----------+-------------+
|  1|    John|         B.A.|  20|  Male|       0.0|(5,[0],[1.0])|
|  2|  Martha|       B.Com.|null|Female|       1.0|(5,[1],[1.0])|
|  3|    Mona|       B.Com.|  21|Female|       1.0|(5,[1],[1.0])|
|  4|  Harish|        B.Sc.|  22|  Male|       2.0|(5,[2],[1.0])|
|  5|     Sam|           NA|  35|  Male|       3.0|(5,[3],[1.0])|
|  6|   Jonny|         B.A.|  22|  Male|       0.0|(5,[0],[1.0])|
|  7|   Maria|         B.A.|null|Female|       0.0|(5,[0],[1.0])|
|  8|      NA|         B.A.|  25|  Male|       0.0|(5,[0],[1.0])|
|  9|Monalisa|         B.A.|  21|Female|       0.0|(5,[0],[1.0])|
+---+--------+-------------+----+------+----------+-------------+

推荐阅读