首页 > 解决方案 > 按操作过滤在 Pig 中不起作用,不确定发生了什么?

问题描述

我被困在尝试使用 Pig 提取使用 lat-long 边界的特定位置的推文。

我已经运行了以下脚本,它一直有效,直到我过滤纬度/经度,然后它就死了。

我的剧本

REGISTER 'hdfs/json-simple-1.1.jar';
REGISTER 'hdfs/elephant-bird-hadoop-compat-4.1.jar';
REGISTER 'hdfs/elephant-bird-pig-4.1.jar';

-- this is just one day, there is a bunch more data, once the script is working well
-- /data/ProjectDataset/statuses.log.2014-12-31.gz
tweets_all = LOAD '/data/ProjectDataset/statuses.log.2014-12-3*' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);

-- JUST THE COORDINATES
-- to get the geo locations of tweets
tweets_all = FOREACH tweets_all GENERATE FLATTEN(json#'created_at') as time_stamp:chararray, FLATTEN(json#'id') as id:chararray, FLATTEN(json#'coordinates') as (coords_map:map[]);

-- remove duplicates
tweets = DISTINCT tweets_all;

-- filter for tweets with geo tags
filtered = FILTER tweets BY (coords_map IS NOT NULL);

-- parse the date time and unpack the geo data
locs1 = foreach filtered generate ToDate(time_stamp, 'EEE MMM dd HH:mm:ss Z yyyy') as time_stamp, coords_map#'coordinates' as coordinates:bag{t1:tuple(f1:double, f2:double)}, id as id;

-- reference longitude and latitude
locs2 = foreach locs1 generate BagToTuple(coordinates).$0 as longitude:double, BagToTuple(coordinates).$1 as latitude:double, id, time_stamp;

-- filter for tweets with geo tags with longs between (-70.0 and -80.0) and lats between (35.0 and 45.0)
geo_filtered = FILTER locs2 BY (longitude > 35) and (longitude < 45) and (latitude > -80) and (latitude < -70);

-- look at the top results
tops = limit geo_filtered 10;
dump tops;

它适用于 locs2,因为运行tops = limit locs2 5;dump tops;返回:

(-81.9536, 34.9307, 549701401182351360, 2014-12-29T23:00:01.000Z)

(-46.455577, -23.505258, 549701401186938883, 2014-12-29T23:00:01.000Z)

(179.0、81.0、549701401191129089、2014-12-29T23:00:01.000Z)

(-4.186111, 39.742536, 549701401203732481, 2014-12-29T23:00:01.000Z)

(12.094579, 57.928088, 549701401207930880, 2014-12-29T23:00:01.000Z)

此外,运行describe locs2会导致:

locs2: {longitude: double,latitude: double,id: chararray,time_stamp: datetime}

它显然不喜欢 locs2 上的过滤器操作,但我不知道为什么?

提前致谢!

标签: twitterapache-pig

解决方案


推荐阅读