pyspark - VectorIndexer 或 OneHotEncoder 用于分类变量?
问题描述
在处理分类变量作为 Spark 中的 ML 算法的输入时,对 VectorIndexer 或 OneHotEncoder 的使用有些困惑。是不是当我需要知道 ML 输出中每个分类级别的效果时,我需要使用 OneHotEncoder 而在其他情况下可以使用 VectorIndexer?
示例如图所示:
from pyspark.ml.feature import OneHotEncoder, VectorAssembler , VectorIndexer
df = sqlContext.createDataFrame([
(0.0, 3.0, 3.8),
(1.0, 0.0, 6.7),
(2.0, 3.0, 3.3),
(0.0, 2.0, 1.2),
(0.0, 1.0, 7.8),
(2.0, 0.0, 4.4)
], ["category1", "category2","readings"])
encoder = OneHotEncoder(dropLast = True, inputCols=["category1", "category2"],
outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)
encoded = model.transform(df)
encoded.show()
+---------+---------+--------+-------------+-------------+
|category1|category2|readings| categoryVec1| categoryVec2|
+---------+---------+--------+-------------+-------------+
| 0.0| 3.0| 3.8|(2,[0],[1.0])| (3,[],[])|
| 1.0| 0.0| 6.7|(2,[1],[1.0])|(3,[0],[1.0])|
| 2.0| 3.0| 3.3| (2,[],[])| (3,[],[])|
| 0.0| 2.0| 1.2|(2,[0],[1.0])|(3,[2],[1.0])|
| 0.0| 1.0| 7.8|(2,[0],[1.0])|(3,[1],[1.0])|
| 2.0| 0.0| 4.4| (2,[],[])|(3,[0],[1.0])|
+---------+---------+--------+-------------+-------------+
va = VectorAssembler(inputCols = df.columns , outputCol = 'features')
assembled = va.transform(df)
idx = VectorIndexer(inputCol = 'features', outputCol = 'features_indexed', maxCategories = 4)
idx_model = idx.fit(assembled)
transformed = idx_model.transform(assembled)
transformed.show()
+---------+---------+--------+-------------+----------------+
|category1|category2|readings| features|features_indexed|
+---------+---------+--------+-------------+----------------+
| 0.0| 3.0| 3.8|[0.0,3.0,3.8]| [0.0,3.0,3.8]|
| 1.0| 0.0| 6.7|[1.0,0.0,6.7]| [1.0,0.0,6.7]|
| 2.0| 3.0| 3.3|[2.0,3.0,3.3]| [2.0,3.0,3.3]|
| 0.0| 2.0| 1.2|[0.0,2.0,1.2]| [0.0,2.0,1.2]|
| 0.0| 1.0| 7.8|[0.0,1.0,7.8]| [0.0,1.0,7.8]|
| 2.0| 0.0| 4.4|[2.0,0.0,4.4]| [2.0,0.0,4.4]|
+---------+---------+--------+-------------+----------------+
idx_model.categoryMaps
{0: {0.0: 0, 1.0: 1, 2.0: 2}, 1: {0.0: 0, 1.0: 1, 2.0: 2, 3.0: 3}}
解决方案
To my understanding, OneHotEncoder
applies only to numerical columns. If your categorical variable is StringType, then you need to pass it through StringIndexer
first before you can apply OneHotEncoder.
StringIndexer transforms the labels into numbers, then OneHotEncoder creates the coded column for each value.
The way Spark outputs results of OneHotEncoder is unintuitive, the docs says in Notes section:
This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.
If your categorical column is a Vector or an Array of Strings, then you would use VectorIndexer
, then OneHotEncoder
. Specifically, you can use VectorIndexer on your "features" column. Here's a similar question.
You need to fill in the nulls first in your categorical columns.
In PySpark, that's df.na.fill("value", subset=["col1","col2",...])
.
In Scala, that's df.na.fill("value", Seq("col1","col2",...))
Here's the full application example,
dummydata= [
(1,"John","B.A.",20,"Male"),
(2,"Martha","B.Com.",None,"Female"),
(3,"Mona","B.Com.",21,"Female"),
(4,"Harish","B.Sc.",22,"Male"),
(5,"Sam",None,35,"Male"),
(6,"Jonny","B.A.",22,"Male"),
(7,"Maria","B.A.",None,"Female"),
(8,None,"B.A.",25,"Male"),
(9,"Monalisa","B.A.",21,"Female")
]
toydf= spark.createDataFrame(data = dummydata, schema = ["id", "name", "qualification", "age", "gender"])
toydf.show()
+---+--------+-------------+----+------+
| id| name|qualification| age|gender|
+---+--------+-------------+----+------+
| 1| John| B.A.| 20| Male|
| 2| Martha| B.Com.|null|Female|
| 3| Mona| B.Com.| 21|Female|
| 4| Harish| B.Sc.| 22| Male|
| 5| Sam| null| 35| Male|
| 6| Jonny| B.A.| 22| Male|
| 7| Maria| B.A.|null|Female|
| 8| null| B.A.| 25| Male|
| 9|Monalisa| B.A.| 21|Female|
+---+--------+-------------+----+------+
toydf= toydf\
.na.fill("NA", subset=["name","qualification"])\
toydf.show()
+---+--------+-------------+----+------+
| id| name|qualification| age|gender|
+---+--------+-------------+----+------+
| 1| John| B.A.| 20| Male|
| 2| Martha| B.Com.|null|Female|
| 3| Mona| B.Com.| 21|Female|
| 4| Harish| B.Sc.| 22| Male|
| 5| Sam| NA| 35| Male|
| 6| Jonny| B.A.| 22| Male|
| 7| Maria| B.A.|null|Female|
| 8| NA| B.A.| 25| Male|
| 9|Monalisa| B.A.| 21|Female|
+---+--------+-------------+----+------+
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer, VectorIndexer
indexer_1= StringIndexer(inputCols= ["qualification"], outputCols=["qual_index"], handleInvalid='keep', stringOrderType='frequencyDesc')
ohe_1= OneHotEncoder(inputCols=["qual_index"], outputCols=["qual_coded"], handleInvalid='keep',dropLast=True)
toydf= indexer_1.fit(toydf).transform(toydf)
toydf= ohe_1.fit(toydf).transform(toydf)
toydf.show()
+---+--------+-------------+----+------+----------+-------------+
| id| name|qualification| age|gender|qual_index| qual_coded|
+---+--------+-------------+----+------+----------+-------------+
| 1| John| B.A.| 20| Male| 0.0|(5,[0],[1.0])|
| 2| Martha| B.Com.|null|Female| 1.0|(5,[1],[1.0])|
| 3| Mona| B.Com.| 21|Female| 1.0|(5,[1],[1.0])|
| 4| Harish| B.Sc.| 22| Male| 2.0|(5,[2],[1.0])|
| 5| Sam| NA| 35| Male| 3.0|(5,[3],[1.0])|
| 6| Jonny| B.A.| 22| Male| 0.0|(5,[0],[1.0])|
| 7| Maria| B.A.|null|Female| 0.0|(5,[0],[1.0])|
| 8| NA| B.A.| 25| Male| 0.0|(5,[0],[1.0])|
| 9|Monalisa| B.A.| 21|Female| 0.0|(5,[0],[1.0])|
+---+--------+-------------+----+------+----------+-------------+
推荐阅读
- azure-active-directory - 如何在事件网格身份验证 webhook Azure Acitve Directory 中包含角色?
- c# - 如何使我在 Program.cs 中读取的天蓝色机密对控制器可用?
- c# - 使用模板列从数据网格中获取选定的单元格
- python - 如何从 url 列表中查找有多少 url 正在工作
- python - 将数据类型从对象转换为整数
- angular - Angular2+ (9) 内容安全策略 - 如何解决不安全的内联样式?
- python - 更快地知道一个小部件是否属于小部件树
- linux - 如何在shell脚本中将数组导出为环境变量?
- java - 使用 Jsp 的单选按钮值为 null
- powershell - 我创建了一个 Hyper-V 副本警报脚本,但在满足条件时努力让它提醒我