首页 > 解决方案 > 我的一些专栏中有不需要的数据。如何摆脱它?

问题描述

正如您在下面的年龄性别列中看到的那样,我有一些数据,而它的值应该是空值或数字,为什么单元格会相互冲突?如何清理我的列?

据我了解,问题的根源是描述列,有些单元格显示为空/或数据显示有一些非删除空格,而它们有数据,所以当我阅读文件时,描述的内容显示在年龄和性别栏

df = sqlContext.read.csv("/FileStore/tables/mtmedical_V6-16623.csv", header=True)
df.show(150)

输出:

+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
|         description|   medical_specialty|                 age|              gender|sample_name (What has been done to patient = Treatment)|       transcription|            keywords|
+--------------------+--------------------+--------------------+--------------------+-------------------------------------------------------+--------------------+--------------------+
| A 23-year-old wh...| Allergy / Immuno...|                  23|              female|                                     Allergic Rhinitis |SUBJECTIVE:,  Thi...|allergy / immunol...|
| Consult for lapa...|          Bariatrics|                null|                male|                                    Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
| Consult for lapa...|          Bariatrics|                  42|                male|                                    Laparoscopic Gas...|"HISTORY OF PRESE...| at his highest h...|
| 2-D M-Mode. Dopp...| Cardiovascular /...|                null|                null|                                    2-D Echocardiogr...|2-D M-MODE: , ,1....|cardiovascular / ...|
|  2-D Echocardiogram| Cardiovascular /...|                null|                male|                                    2-D Echocardiogr...|1.  The left vent...|cardiovascular / ...|
| Morbid obesity. ...|          Bariatrics|                  30|                male|                                    Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, gastr...|
| Liposuction of t...|                null|                null|                null|                                                   null|                null|                null|
|", Bariatrics,31,...|       1.  Deformity| right breast rec...|2.  Excess soft t...|                                    anterior abdomen...|3.  Lipodystrophy...|POSTOPERATIVE DIA...|
|  2-D Echocardiogram| Cardiovascular /...|                null|                male|                                    2-D Echocardiogr...|2-D ECHOCARDIOGRA...|cardiovascular / ...|
| Suction-assisted...|          Bariatrics|                null|                male|                                    Lipectomy - Abdo...|PREOPERATIVE DIAG...|bariatrics, lipod...|
| Echocardiogram a...| Cardiovascular /...|                null|                null|                                    2-D Echocardiogr...|DESCRIPTION:,1.  ...|cardiovascular / ...|
| Morbid obesity. ...|          Bariatrics|                  50|                male|                                    Laparoscopic Gas...|PREOPERATIVE DIAG...|bariatrics, morbi...|
| Normal left vent...| Cardiovascular /...|                null|                male|                                           2-D Doppler |2-D STUDY,1. Mild...|cardiovascular / ...|
| Cerebral Angiogr...|           Neurology|                  31|                male|                                      Moyamoya Disease |"CC:, Confusion a...| she was found ""...|

这就是 csv 文件的样子

标签: pythonapache-sparkpysparkapache-spark-sqldata-science

解决方案


一种替代方法是映射数据框并删除“坏行”。但是,如果您要获取多个这样的 csv 文件,这将不是一个非常可扩展的过程。

第二种选择是清理csv文件本身。在我看来,该文件的选项卡或空格不正确,可能会出现问题。

最后,您可以尝试以下方法

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.csv("/FileStore/tables/mtmedical_V6-16623.csv")

这将消除带有多个换行符的文本内容,这可能是这里的问题。


推荐阅读