首页 > 解决方案 > 如何使用自然语言处理从文本pyspark中提取一个简单的字符串

问题描述

我有一个包含 4 列的 pyspark 数据框。一列包含一个文本(数据是非结构化的)。下面是该列的数据示例:

data = [('Ambitioni dedisse scripsisse iudicaretur',)
,('Cras mattisiudicium',)
,('purus sit amet fermentum',)
,('Donec sed odio operae- NORMAL)
,('eu vulputate felis - A300B4-61 - MP 13219',)
,('Praeterea iter est - quasdam res - MP 28180',)
,('quas ex communi - ,)
,('At nos hinc posthat CONTROL - FADEC',)
,('sitientis piros Afros. Petierunt',)
,('uti sibi concilium totius Galliae-2 - GENERATION',)
,('in dim - V105X )
,('Cras mattis iudicium',)]

df = spark.createDataFrame(data, ["text"])

预期输出示例:

   Interest Column == Exemple data                                                                      new_column                                                                                                       
    --------------------------------------------------------------------------------------------------------------------------------------|----------------------------
    Cras mattis iudicium -INTRODCE A NEW STANDARD 

    ------------------------------------------------------------------------------------------------------------------------
    Praeterea iter est                       
    ------------------------------------------------------------------------------------------------------------------------

    Cras mattis iudicium purus sit amet fermentum. 
    ------------------------------------------------------------------------------------------------------------------------
     class to truncate the text ---------------------------------------------------------------------------------------------------------|----------------------------
    Ambitioni dedisse -
    ------------------------------------------------------------------------------------------------------------------------
    For left, right, ------------------------------------------------------------------------------------------------------
    TCAS II - Praeterea iter est     | 
    ------------------------------------------------------------------------------------------------------------------------
    Donec sed odio operae 
    ------------------------------------------------------------------------------------------------------------------------
    Ambitioni dedisse                                                                                            |
    ------------------------------------------------------------------------------------------------------------------------


My question: 
Thank you



标签: apache-sparkpysparknlp

解决方案


推荐阅读