python-3.x - 从列表中的元素中删除尾随空格
问题描述
我有一个火花数据框,其中给定的列是一些文本。我正在尝试清理文本并用逗号分隔,这将输出一个包含单词列表的新列。
我遇到的问题是该列表中的某些元素包含我想删除的尾随空格。
代码:
# Libraries
# Standard Libraries
from typing import Dict, List, Tuple
# Third Party Libraries
import pyspark
from pyspark.ml.feature import Tokenizer
from pyspark.sql import SparkSession
import pyspark.sql.functions as s_function
def tokenize(sdf, input_col="text", output_col="tokens"):
# Remove email
sdf_temp = sdf.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
# Remove digits
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
# Remove one(1) character that is not a word character except for
# commas(,), since we still want to split on commas(,)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " "))
# Split the affiliation string based on a comma
sdf_temp = sdf_temp.withColumn(
colName=output_col,
col=s_function.split(sdf_temp[input_col], ", "))
return sdf_temp
if __name__ == "__main__":
# Sample data
a_1 = "Department of Bone and Joint Surgery, Ehime University Graduate"\
" School of Medicine, Shitsukawa, Toon 791-0295, Ehime, Japan."\
" shinyama@m.ehime-u.ac.jp."
a_2 = "Stroke Pharmacogenomics and Genetics, Fundació Docència i Recerca"\
" Mútua Terrassa, Hospital Mútua de Terrassa, 08221 Terrassa, Spain."
a_3 = "Neurovascular Research Laboratory, Vall d'Hebron Institute of Research,"\
" Hospital Vall d'Hebron, 08035 Barcelona, Spain;catycarrerav@gmail.com"\
" (C.C.). catycarrerav@gmail.com."
data = [(1, a_1), (2, a_2), (3, a_3)]
spark = SparkSession\
.builder\
.master("local[*]")\
.appName("My_test")\
.config("spark.ui.port", "37822")\
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
af_data = spark.createDataFrame(data, ["index", "text"])
sdf_tokens = tokenize(af_data)
# sdf_tokens.select("tokens").show(truncate=False)
输出
|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon , Ehime, Japan ] |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain ] |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C ]
期望的输出:
|[Department of Bone and Joint Surgery, Ehime University Graduate School of Medicine, Shitsukawa, Toon, Ehime, Japan] |
|[Stroke Pharmacogenomics and Genetics, Fundaci Doc ncia i Recerca M tua Terrassa, Hospital M tua de Terrassa, Terrassa, Spain] |
|[Neurovascular Research Laboratory, Vall d Hebron Institute of Research, Hospital Vall d Hebron, Barcelona, Spain C C]
所以在
- 第 1 行:
'Toon ' -> 'Toon'
,'Japan ' -> 'Japan'
. - 第二行:
'Spain ' -> 'Spain'
- 第三行:
'Spain C C ' -> 'Spain C C'
笔记
尾随空格不仅出现在列表的最后一个元素中,还可以出现在任何元素中。
解决方案
更新
原始解决方案不起作用,因为trim
只对整个字符串的开头和结尾进行操作,而您需要它来处理每个标记。
@PatrickArtner的解决方案有效,但另一种方法是使用RegexTokenizer
.
以下是如何修改tokenize()
函数的示例:
from pyspark.ml.feature import RegexTokenizer
def tokenize(sdf, input_col="text", output_col="tokens"):
# Remove email
sdf_temp = sdf.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[\w\.-]+@[\w\.-]+\.\w+", ""))
# Remove digits
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "\d", ""))
# Remove one(1) character that is not a word character except for
# commas(,), since we still want to split on commas(,)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.regexp_replace(s_function.col(input_col), "[^a-zA-Z0-9,]+", " "))
# call trim to remove any trailing (or leading spaces)
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.trim(sdf_temp[input_col]))
# use RegexTokenizer to split on commas optionally surrounded by whitespace
myTokenizer = RegexTokenizer(
inputCol=input_col,
outputCol=output_col,
pattern="( +)?, ?")
sdf_temp = myTokenizer.transform(sdf_temp)
return sdf_temp
本质上,调用trim
你的字符串来处理任何前导或尾随空格。然后使用RegexTokenizer
使用模式进行拆分"( +)?, ?"
。
( +)?
: 匹配零个和无限个空格,
: 完全匹配逗号?
: 匹配一个可选空格
这是输出
sdf_tokens.select('tokens', f.size('tokens').alias('size')).show(truncate=False)
您可以看到数组的长度(标记数)是正确的,但是所有标记都是小写的(因为这是做什么Tokenizer
和RegexTokenizer
做什么的)。
+------------------------------------------------------------------------------------------------------------------------------+----+
|tokens |size|
+------------------------------------------------------------------------------------------------------------------------------+----+
|[department of bone and joint surgery, ehime university graduate school of medicine, shitsukawa, toon, ehime, japan] |6 |
|[stroke pharmacogenomics and genetics, fundaci doc ncia i recerca m tua terrassa, hospital m tua de terrassa, terrassa, spain]|5 |
|[neurovascular research laboratory, vall d hebron institute of research, hospital vall d hebron, barcelona, spain c c] |5 |
+------------------------------------------------------------------------------------------------------------------------------+----+
原始答案
只要您使用的是 Spark 1.5 或更高版本,您就可以使用pyspark.sql.functions.trim()
which 将:
修剪指定字符串列两端的空格。
所以一种方法是添加:
sdf_temp = sdf_temp.withColumn(
colName=input_col,
col=s_function.trim(sdf_temp[input_col]))
在您的tokenize()
功能结束时。
但是您可能想要查看pyspark.ml.feature.Tokenizer
or pyspark.ml.feature.RegexTokenizer
。一个想法可能是使用您的函数来清理您的字符串,然后使用Tokenizer
来制作标记。(我看到你已经导入了它,但似乎没有使用它)。
推荐阅读
- python - 如何从这些图像中提取手部特征?
- python - AttributeError:“numpy.ndarray”对象没有属性“apply”
- rtf - 如何在 OBIEE BIP RTF 中为每个输入参数显示“无结果”消息?
- ios - iOS - 删除应用程序时未删除应用程序文件
- angularjs - 无法构建与 Angular 4 集成的 Angular JS 应用程序
- java - 为什么我的 TCP 系统比 UDP 系统快?
- android - Driver API - 获取当前行程的 /partners/trips()
- angularjs - JSF 应用程序:我应该使用微服务以及如何使用?
- python - 在 xlwings 中覆盖 Python 路径
- c++ - boost::spirit -- 试图编译最简单代码的编译器错误