apache-spark - Pyspark 将 Dataframe 字符串列拆分为多列
问题描述
我正在 spark 3.0.0 上执行 Spark Structure 流式传输的示例,为此,我正在使用 twitter 数据。我已经在 Kafka 中推送了 twitter 数据,单条记录看起来像这样
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi:与@IBM 的首席执行官@ArvindKrishna 先生进行了广泛的互动。我们讨论了几个与技术有关的主题,……|印度海得拉巴
这里每个字段都用'|'分隔 字段是
约会时间
用户身份
推文
地点
现在在 Spark 中阅读这条消息我得到了这样的数据框
key | value
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
根据这个答案,我在我的应用程序中添加了以下代码块。
split_col = pyspark.sql.functions.split(df['value'], '|')
df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")
但它给了我这样的输出,
A | B | C | D | E |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |
但我想要这样的输出
Tweet Time | User ID | Tweet text | Location |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |
解决方案
因为它接受一个模式:一个表示正则表达式的字符串。正则表达式字符串应该是 Java 正则表达式。
用于"\\|"
按管道或'[|]'
split_col = split(df.value, '\\|',)
df = df.withColumn("Tweet Time", split_col.getItem(0))\
.withColumn("User ID", split_col.getItem(1))\
.withColumn("Tweet Text", split_col.getItem(2))\
.withColumn("Location", split_col.getItem(3))\
.drop("key")
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|value |Tweet Time |User ID |Tweet Text |Location |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
推荐阅读
- redux - Redux 可以 reducer 接受多个动作
- java - 如何转换单声道
- > 流式传输
? - c++ - 如何找到此交换函数(单链表)中的错误?
- python-3.x - 在机器学习中,编码非层次分类特征的最佳方法是什么?
- android - 无法将图像从 android 应用程序上传到球衣服务器
- webpack - 如何使用 webpack 忽略或替换一些没有真正使用的模块?
- python - 如何使用 pandas 将字符串与数据框中的字符串进行比较?
- java - 如何修复“错误:';' 预期”或“错误:')' 预期”和布尔值中的错误
- c - 从 C 中的函数返回后数组元素发生变化
- ios - Xcode:将默认数据添加到核心数据