首页 > 解决方案 > 在 Spark Scala DataFrame 中过滤以 > 开头的列文本

问题描述

我只需要过滤列中从 > 开始的文本。我知道有函数 startsWith & contains 可用于字符串,但我需要将它应用于 DataFrame 中的列。

 val dataSet = spark.read.option("header","true").option("inferschema","true").json(input).cace()
 dataSet.select(col = "_source.content").filter(_.startsWith(">"))

startsWith 不适用于数据集。

标签: scalaparsingapache-sparkapache-spark-sqlbigdata

解决方案


是的,它确实如此,例如:

import org.apache.spark.sql.Column

val df = List(
  ("1001", "[physics, chemistry]", "pass"),
  ("1001", "[biology, math]", "fail"),
  ("3002", "[economics]", "pass"),
  ("2002", "[physics, chemistry]", "fail")
).toDF("student_id", "subjects", "result")

df.filter(col("student_id").startsWith("3")).show

返回:

+----------+-----------+------+
|student_id|   subjects|result|
+----------+-----------+------+
|      3002|[economics]|  pass|
+----------+-----------+------+

对于 JSON 派生的输入 - albiet 并不真正相关,一个使用 DF 而不是 DS 的示例(也适用于 DS),结构内的字段只有细微差别:

import org.apache.spark.sql.Column
val df = spark.read.json("/FileStore/tables/json_nested_4.txt")

import org.apache.spark.sql.functions._
val flattened = df.select($"name", explode($"schools").as("schools_flat"))

flattened.filter(col("name").startsWith("J")).show
flattened.filter(col("schools_flat.sname").startsWith("u")).show

基本输入和结构:

+-------+----------------+
|   name|    schools_flat|
+-------+----------------+
|Michael|[stanford, 2010]|
|Michael|[berkeley, 2012]|
|   Andy|    [ucsb, 2011]|
| Justin|[berkeley, 2014]|
+-------+----------------+

flattened: org.apache.spark.sql.DataFrame = [name: string, schools_flat: struct<sname: string, year: bigint>]

返回:

+------+----------------+
|  name|    schools_flat|
+------+----------------+
|Justin|[berkeley, 2014]|
+------+----------------+

+----+------------+
|name|schools_flat|
+----+------------+
|Andy|[ucsb, 2011]|
+----+------------+

推荐阅读