首页 > 解决方案 > 使用 pig 从数据文件中删除不良数据

问题描述

我有一个这样的数据文件

1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82 
1976 66 3
1913 35 
1990 45 1
1927 92 A
1912  2
1924 22 
1971  2
1959 94 E

现在使用 pig 脚本我想删除坏数据,比如删除那些有字符和空字段的行我试过这种方式

records = load '/user/a106524609/test.txt' using PigStorage(' ') as 
(year:chararray, temperature:int, quality:int); 
rec1 = filter records by temperature != 'null' and (quality != 'null ')

标签: hadoophdfsapache-pig

解决方案


将其加载为行

A = load 'data.txt' using PigStorage('\n') as (line:chararray);

拆分所有空格

B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);

按有效字符串过滤

C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');

(可选)转换为 int

D = FOREACH C GENERATE year,temp,(int)quality;

在 Spark 中,我将从预期格式的正则表达式匹配开始。

val cleanRows = sc.textFile("data.txt")
    .filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

推荐阅读