首页 > 解决方案 > 打印非 Ascii 列值,python-spark

问题描述

对于 python 和 spark 来说非常新,我写了一个 udf 来删除字符串中存在的非 ascii 字符。

让它在执行操作的同时打印错误值的最有效方法是什么?(错误值将是包含非 ascii 字符的单元格)

代码:

import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext

from pyspark.sql.window import Window
from pyspark.sql.functions import count, col
from pyspark.sql import Row
from pyspark.sql.functions import udf
def nonasciitoascii(unicodestring):
   return unicodestring.encode("ascii","ignore")

df=spark.read.csv("abc.csv")
df.show()

df.printSchema()

convertedudf = udf(nonasciitoascii)
converted = df.select('_c1','_c2').withColumn('converted',convertedudf(df._c1))
    converted.show()

标签: pythonpysparkascii

解决方案


一个在大多数情况下都有效的简单解决方案是为此目的运行计算:

# in python 3
def check_ascii(string):
    if(not string.isascii()):
        return string
    else:
        return None

def check_ascii_in_python_2(string):
     if(not all(ord(char) < 128 for char in string)):
         return string
     else:
         return None

all_strings_with_non_ascii_chars = df.select('_c1','_c2').withColumn('check', check_ascii(df._c1)).filter('check is not null').select('check')
all_strings_with_non_ascii_chars.show()

推荐阅读