and ,python,pandas,dataframe,pyspark,apache-spark-sql"/>

首页 > 解决方案 > TypeError: element in array field Category: Can not merge type and

问题描述

I am reading the csv file using Pandas, it's a two column dataframe, and then I am trying to convert to the spark dataframe. The Code for this is:

from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(df)

The dataframe:

print(df) 

gives this :

    Name    Category
0   EDSJOBLIST apply at www.edsjoblist.com  ['biotechnology', 'clinical', 'diagnostic', 'd...
1   Power Direct Marketing  ['advertising', 'analytics', 'brand positionin...
2   CHA Hollywood Medical Center, L.P.  ['general medical and surgical hospital', 'hea...
3   JING JING GOURMET   [nan]
4   TRUE LIFE KINGDOM MINISTRIES    ['religious organization']
5   fasterproms ['microsoft .net']
6   STEREO ZONE ['accessory', 'audio', 'car audio', 'chrome', ...
7   SAN FRANCISCO NEUROLOGICAL SOCIETY  [nan]
8   Fl Advisors ['comprehensive financial planning', 'financia...
9   Fortunatus LLC  ['bottle', 'bottling', 'charitable', 'dna', 'f...
10  TREADS LLC  ['retail', 'wholesaling']

Can anyone help me with this ?

标签: pythonpandasdataframepysparkapache-spark-sql

解决方案


Spark can have difficulty dealing with object datatypes. A potential workaround is to convert everything to a string first:

sdf = sqlCtx.createDataFrame(df.astype(str))

One consequence of this is that everything, including nan will be converted to string. You will need to take care to properly handle these conversions and cast the columns to the appropriate type.

For instance, if you had a column "colA" with floating point values, you can use something like the following to convert the string "nan" to a null:

from pyspark.sql.functions import col, when
sdf = sdf.withColumn("colA", when(col("colA") != "nan", col("colA").cast("float")))

推荐阅读