首页 > 解决方案 > 将网站 url 拆分为多列到 scala 数据框

问题描述

我有一列有很多网址,如下所示:

在此处输入图像描述

我需要在split此列上".",任务是获得以下 o/p:

在此处输入图像描述

标签: scalaapache-spark

解决方案


检查下面的代码。

您可以忽略长度列,它用于决定最大列数

scala> val df = Seq("www.google.co.kr","jun.artcompsci.org","mstdn.pssy.flab.fujitsu.cojp").toDF("URL")
df: org.apache.spark.sql.DataFrame = [URL: string]

scala> val adf = df.withColumn("url_array",split($"URL","\\.")).withColumn("length",size($"url_array")).orderBy($"length".desc)
adf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [URL: string, url_array: array<string> ... 1 more field]

scala> val length = adf.select("length").head.getInt(0)
length: Int = 5

scala> adf.select($"*" +: (0 until length).map(i => $"url_array".getItem(i).as(s"col$i")): _*).show(false)
+----------------------------+----------------------------------+------+-----+----------+----+-------+----+
|URL                         |url_array                         |length|col0 |col1      |col2|col3   |col4|
+----------------------------+----------------------------------+------+-----+----------+----+-------+----+
|mstdn.pssy.flab.fujitsu.cojp|[mstdn, pssy, flab, fujitsu, cojp]|5     |mstdn|pssy      |flab|fujitsu|cojp|
|www.google.co.kr            |[www, google, co, kr]             |4     |www  |google    |co  |kr     |null|
|jun.artcompsci.org          |[jun, artcompsci, org]            |3     |jun  |artcompsci|org |null   |null|
+----------------------------+----------------------------------+------+-----+----------+----+-------+----+



推荐阅读