twitter - 按操作过滤在 Pig 中不起作用,不确定发生了什么?
问题描述
我被困在尝试使用 Pig 提取使用 lat-long 边界的特定位置的推文。
我已经运行了以下脚本,它一直有效,直到我过滤纬度/经度,然后它就死了。
我的剧本
REGISTER 'hdfs/json-simple-1.1.jar';
REGISTER 'hdfs/elephant-bird-hadoop-compat-4.1.jar';
REGISTER 'hdfs/elephant-bird-pig-4.1.jar';
-- this is just one day, there is a bunch more data, once the script is working well
-- /data/ProjectDataset/statuses.log.2014-12-31.gz
tweets_all = LOAD '/data/ProjectDataset/statuses.log.2014-12-3*' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
-- JUST THE COORDINATES
-- to get the geo locations of tweets
tweets_all = FOREACH tweets_all GENERATE FLATTEN(json#'created_at') as time_stamp:chararray, FLATTEN(json#'id') as id:chararray, FLATTEN(json#'coordinates') as (coords_map:map[]);
-- remove duplicates
tweets = DISTINCT tweets_all;
-- filter for tweets with geo tags
filtered = FILTER tweets BY (coords_map IS NOT NULL);
-- parse the date time and unpack the geo data
locs1 = foreach filtered generate ToDate(time_stamp, 'EEE MMM dd HH:mm:ss Z yyyy') as time_stamp, coords_map#'coordinates' as coordinates:bag{t1:tuple(f1:double, f2:double)}, id as id;
-- reference longitude and latitude
locs2 = foreach locs1 generate BagToTuple(coordinates).$0 as longitude:double, BagToTuple(coordinates).$1 as latitude:double, id, time_stamp;
-- filter for tweets with geo tags with longs between (-70.0 and -80.0) and lats between (35.0 and 45.0)
geo_filtered = FILTER locs2 BY (longitude > 35) and (longitude < 45) and (latitude > -80) and (latitude < -70);
-- look at the top results
tops = limit geo_filtered 10;
dump tops;
它适用于 locs2,因为运行tops = limit locs2 5;
并dump tops;
返回:
(-81.9536, 34.9307, 549701401182351360, 2014-12-29T23:00:01.000Z)
(-46.455577, -23.505258, 549701401186938883, 2014-12-29T23:00:01.000Z)
(179.0、81.0、549701401191129089、2014-12-29T23:00:01.000Z)
(-4.186111, 39.742536, 549701401203732481, 2014-12-29T23:00:01.000Z)
(12.094579, 57.928088, 549701401207930880, 2014-12-29T23:00:01.000Z)
此外,运行describe locs2
会导致:
locs2: {longitude: double,latitude: double,id: chararray,time_stamp: datetime}
它显然不喜欢 locs2 上的过滤器操作,但我不知道为什么?
提前致谢!
解决方案
推荐阅读
- python - Python - 从数组中选择随机名称而不重复,直到全部选择
- react-bootstrap - 如何设置 React Bootstrap Table 列过滤框的样式
- c - 找不到 -lstatic
- sql-server - 无法使用命名空间解析 XML
- spring - 如何使用 spring-data-ldap 对 ladp 用户进行身份验证?
- javascript - 如何根据从下拉列表中选择的项目设置范围
- javascript - 是否可以通过单击网站上的按钮来启动 android 应用程序
- nosql - 为什么 DynamoDB 不支持简单聚合?
- sql - Redshift:如何创建使用查找表的函数
- javascript - 将变量从 HTML 页面传递到另一个 PHP 页面