pyspark - Confusion on types of Spark RDDs
问题描述
I am just learning Spark and started with RDDs and now moving on to DataFrames. In my current pyspark project, I am reading an S3 file into an RDD and running some simple transformations on them. Here is the code.
segmentsRDD = sc.textFile(fileLocation). \
filter(lambda line: line.split(",")[6] in INCLUDE_SITES). \
filter(lambda line: line.split(",")[2] not in EXCLUDE_MARKETS). \
filter(lambda line: "null" not in line). \
map(splitComma). \
filter(lambda line: line.split(",")[5] == '1')
SplitComma is a function that does some date calculations on the row data and return 10 comma-delimited fields back. Once I get that I run the last filter as shown to only pickup rows where value in field [5] = 1. So far everything is fine.
Next, I would like to convert the segmentsRDD to DF with schema as shown below.
interim_segmentsDF = segmentsRDD.map(lambda x: x.split(",")).toDF("itemid","market","itemkey","start_offset","end_offset","time_shifted","day_shifted","tmsmarketid","caption","itemstarttime")
But I get an error about unable to convert a "pyspark.rdd.PipelinedRDD" to DataFrame. Can you please explain the difference between "pyspark.rdd.PipelinedRDD" and "row RDD"? I am attempting to convert to DF with a schema as shown. What am I missing here?
Thanks
解决方案
推荐阅读
- jquery-selectors - 如何在选择器赛普拉斯中使用 .length 的结果
- devops - 在 ubuntu 中执行 mvn -version 或 mvn 命令时出现 DevOps 错误
- python - 在 Python 中不使用 RegEx 替换或删除文本文件中的特定符号
- html - 我无法通过 CSS 更改项目的宽度
- python - 如何使用 openpyxl 过滤列数据
- asp.net-mvc - 在视觉工作室中禁用打字稿有时不起作用
- ios - UITableview 第一次不显示内容
- elasticsearch - Elasticsearch 丢弃包含查询超集的文档
- shell - Shell使用&符号从字符串执行多个命令(后台进程)
- java - 未来取消不适用于多个线程的 executorService