pyspark - 如何通过 Pyspark 中同一数据框中另一列的正则表达式值过滤数据框中的一列
问题描述
我正在尝试过滤与另一列中给出的正则表达式模式匹配的数据框中的列
df = sqlContext.createDataFrame([('what is the movie that features Tom Cruise','actor_movies','(movie|film).*(feature)|(in|on).*(movie|film)'),
('what is the movie that features Tom Cruise','artist_song','(who|what).*(sing|sang|perform)'),
('who is the singer for hotel califonia?','artist_song','(who|what).*(sing|sang|perform)')],
['query','question_type','regex_patt'])
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)
|what movie features Tom Cruise | artist_song | (who|what).*(sing|sang|perform)
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform) |
+--------------------+-------------+------------------------------------------------+
我想修剪数据框,以便只保留查询与 regex_pattern 列值匹配的行。
最终结果应该是这样的
+--------------------+-------------+----------------------------------------- -+
| query |question_type |regex_patt|
+--------------------+-------------+----------------------------------------------+
|what movie features Tom Cruise | actor_movies | (movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia | artist_song | (who|what).*(sing|sang|perform)
+--------------------+-------------+------------------------------------------------+
我在想
df.filter(column('query').rlike('regex_patt'))
但 rlike 只接受正则表达式字符串。
现在的问题是,如何根据"query"
列的正则表达式值过滤"regex_patt"
列?
解决方案
你可以试试这个。该表达式允许您将列作为 str 和模式。
from pyspark.sql import functions as F
df.withColumn("query1", F.expr("""regexp_extract(query, regex_patt)""")).filter(F.col("query1")!='').drop("query1").show(truncate=False)
+------------------------------------------+-------------+---------------------------------------------+
|query |question_type|regex_patt |
+------------------------------------------+-------------+---------------------------------------------+
|what is the movie that features Tom Cruise|actor_movies |(movie|film).*(feature)|(in|on).*(movie|film)|
|who is the singer for hotel califonia? |artist_song |(who|what).*(sing|sang|perform) |
+------------------------------------------+-------------+---------------------------------------------+
推荐阅读
- vb.net - 使用 Visual Basic 在 .txt 文件中搜索文本
- gradle - Kotlin DSL:将 versions.gradle.kts 导入另一个 build.gradle.kts
- javascript - 沿最佳圆周分布 n 组圆
- javascript - 基于全路径的层次节点路径生成
- php - 创建新的 Laravel 后 Laravel 迁移卡住了
- python - 在 python 中将 Callable 输入到特定方法
- angular - 如何路由到不同模块中的组件?
- php - 在 Yii2 中获取子模块的所有父模块列表,反之亦然
- tomcat - Tomcat 一段时间后不提供 Web 应用程序
- ssh - 通过 Certbot 安装 SSL 证书后无法连接到 SSH